Tuesday, September 19, 2017

Accurate Genomic Prediction Of Human Height

I've been posting preprints on arXiv since its beginning ~25 years ago, and I like to share research results as soon as they are written up. Science functions best through open discussion of new results! After some internal discussion, my research group decided to post our new paper on genomic prediction of human height on bioRxiv and arXiv.

But the preprint culture is nascent in many areas of science (e.g., biology), and it seems to me that some journals are not yet fully comfortable with the idea. I was pleasantly surprised to learn, just in the last day or two, that most journals now have official policies that allow online distribution of preprints prior to publication. (This has been the case in theoretical physics since before I entered the field!) Let's hope that progress continues.

The work presented below applies ideas from compressed sensing, L1 penalized regression, etc. to genomic prediction. We exploit the phase transition behavior of the LASSO algorithm to construct a good genomic predictor for human height. The results are significant for the following reasons:
We applied novel machine learning methods ("compressed sensing") to ~500k genomes from UK Biobank, resulting in an accurate predictor for human height which uses information from thousands of SNPs.

1. The actual heights of most individuals in our replication tests are within a few cm of their predicted height.

2. The variance captured by the predictor is similar to the estimated GCTA-GREML SNP heritability. Thus, our results resolve the missing heritability problem for common SNPs.

3. Out-of-sample validation on ARIC individuals (a US cohort) shows the predictor works on that population as well. The SNPs activated in the predictor overlap with previous GWAS hits from GIANT.
The scatterplot figure below gives an immediate feel for the accuracy of the predictor.
Accurate Genomic Prediction Of Human Height

Louis Lello, Steven G. Avery, Laurent Tellier, Ana I. Vazquez, Gustavo de los Campos, and Stephen D.H. Hsu

We construct genomic predictors for heritable and extremely complex human quantitative traits (height, heel bone density, and educational attainment) using modern methods in high dimensional statistics (i.e., machine learning). Replication tests show that these predictors capture, respectively, ∼40, 20, and 9 percent of total variance for the three traits. For example, predicted heights correlate ∼0.65 with actual height; actual heights of most individuals in validation samples are within a few cm of the prediction. The variance captured for height is comparable to the estimated SNP heritability from GCTA (GREML) analysis, and seems to be close to its asymptotic value (i.e., as sample size goes to infinity), suggesting that we have captured most of the heritability for the SNPs used. Thus, our results resolve the common SNP portion of the “missing heritability” problem – i.e., the gap between prediction R-squared and SNP heritability. The ∼20k activated SNPs in our height predictor reveal the genetic architecture of human height, at least for common SNPs. Our primary dataset is the UK Biobank cohort, comprised of almost 500k individual genotypes with multiple phenotypes. We also use other datasets and SNPs found in earlier GWAS for out-of-sample validation of our results.
This figure compares predicted and actual height on a validation set of 2000 individuals not used in training: males + females, actual heights (vertical axis) uncorrected for gender. For training we z-score by gender and age (due to Flynn Effect for height). We have also tested validity on a population of US individuals (i.e., out of sample; not from UKBB).

This figure illustrates the phase transition behavior at fixed sample size n and varying penalization lambda.

These are the SNPs activated in the predictor -- about 20k in total, uniformly distributed across all chromosomes; vertical axis is effect size of minor allele:

The big picture implication is that heritable complex traits controlled by thousands of genetic loci can, with enough data and analysis, be predicted from DNA. I expect that with good genotype | phenotype data from a million individuals we could achieve similar success with cognitive ability. We've also analyzed the sample size requirements for disease risk prediction, and they are similar (i.e., ~100 times sparsity of the effects vector; so ~100k cases + controls for a condition affected by ~1000 loci).

Note Added: Further comments in response to various questions about the paper.

1) We have tested the predictor on other ethnic groups and there is an (expected) decrease in correlation that is roughly proportional to the "genetic distance" between the test population and the white/British training population. This is likely due to different LD structure (SNP correlations) in different populations. A SNP which tags the true causal genetic variation in the Euro population may not be a good tag in, e.g., the Chinese population. We may report more on this in the future. Note, despite the reduction in power our predictor still captures more height variance than any other existing model for S. Asians, Chinese, Africans, etc.

2) We did not explore the biology of the activated SNPs because that is not our expertise. GWAS hits found by SSGAC, GIANT, etc. have already been connected to biological processes such as neuronal growth, bone development, etc. Plenty of follow up work remains to be done on the SNPs we discovered.

3) Our initial reduction of candidate SNPs to the top 50k or 100k is simply to save computational resources. The L1 algorithms can handle much larger values of p, but keeping all of those SNPs in the calculation is extremely expensive in CPU time, memory, etc. We tested computational cost vs benefit in improved prediction from including more (>100k) candidate SNPs in the initial cut but found it unfavorable. (Note, we also had a reasonable prior that ~10k SNPs would capture most of the predictive power.)

4) We will have more to say about nonlinear effects, additional out-of-sample tests, other phenotypes, etc. in future work.

5) Perhaps most importantly, we have a useful theoretical framework (compressed sensing) within which to think about complex trait prediction. We can make quantitative estimates for the sample size required to "solve" a particular trait.

I leave you with some remarks from Francis Crick:
Crick had to adjust from the "elegance and deep simplicity" of physics to the "elaborate chemical mechanisms that natural selection had evolved over billions of years." He described this transition as, "almost as if one had to be born again." According to Crick, the experience of learning physics had taught him something important — hubris — and the conviction that since physics was already a success, great advances should also be possible in other sciences such as biology. Crick felt that this attitude encouraged him to be more daring than typical biologists who tended to concern themselves with the daunting problems of biology and not the past successes of physics.

Friday, September 15, 2017

Phase Transitions and Genomic Prediction of Cognitive Ability

James Thompson (University College London) recently blogged about my prediction that with sample size of order a million genotypes|phenotypes, one could construct a good genomic predictor for cognitive ability and identify most of the associated common SNPs.
The Hsu Boundary

... The “Hsu boundary” is Steve Hsu’s estimate that a sample size of roughly 1 million people may be required to reliably identify the genetic signals of intelligence.

... the behaviour of an optimization algorithm involving a million variables can change suddenly as the amount of data available increases. We see this behavior in the case of Compressed Sensing applied to genomes, and it allows us to predict that something interesting will happen with complex traits like cognitive ability at a sample size of the order of a million individuals.

Machine learning is now providing new methods of data analysis, and this may eventually simplify the search for the genes which underpin intelligence.
There are many comments on Thompson's blog post, some of them confused. Comments from a user "Donoho-Student" are mostly correct -- he or she seems to understand the subject. (The phase transition discussed is related to the Donoho-Tanner phase transition. More from Igor Carron.)

The chain of logic leading to this prediction has been discussed here before. The excerpt below is from a 2013 post The human genome as a compressed sensor:

Compressed sensing (see also here) is a method for efficient solution of underdetermined linear systems: y = Ax + noise , using a form of penalized regression (L1 penalization, or LASSO). In the context of genomics, y is the phenotype, A is a matrix of genotypes, x a vector of effect sizes, and the noise is due to nonlinear gene-gene interactions and the effect of the environment. (Note the figure above, which I found on the web, uses different notation than the discussion here and the paper below.)

Let p be the number of variables (i.e., genetic loci = dimensionality of x), s the sparsity (number of variables or loci with nonzero effect on the phenotype = nonzero entries in x) and n the number of measurements of the phenotype (i.e., the number of individuals in the sample = dimensionality of y). Then  A  is an  n x p  dimensional matrix. Traditional statistical thinking suggests that  n > p  is required to fully reconstruct the solution  x  (i.e., reconstruct the effect sizes of each of the loci). But recent theorems in compressed sensing show that  n > C s log p  is sufficient if the matrix A has the right properties (is a good compressed sensor). These theorems guarantee that the performance of a compressed sensor is nearly optimal -- within an overall constant of what is possible if an oracle were to reveal in advance which  s  loci out of  p  have nonzero effect. In fact, one expects a phase transition in the behavior of the method as  n  crosses a critical threshold given by the inequality. In the good phase, full recovery of  x  is possible.

In the paper below, available on arxiv, we show that

1. Matrices of human SNP genotypes are good compressed sensors and are in the universality class of random matrices. The phase behavior is controlled by scaling variables such as  rho = s/n  and our simulation results predict the sample size threshold for future genomic analyses.

2. In applications with real data the phase transition can be detected from the behavior of the algorithm as the amount of data  n  is varied. A priori knowledge of  s  is not required; in fact one deduces the value of  s  this way.

3.  For heritability h2 = 0.5 and p ~ 1E06 SNPs, the value of  C log p  is ~ 30. For example, a trait which is controlled by s = 10k loci would require a sample size of n ~ 300k individuals to determine the (linear) genetic architecture.
For more posts on compressed sensing, L1-penalized optimization, etc. see here. Because s could be larger than 10k, the common SNP heritability of cognitive ability might be less than 0.5, and the phenotype measurements are noisy, and because a million is a nice round figure, I usually give that as my rough estimate of the critical sample size for good results. The estimate that s ~ 10k for cognitive ability and height originates here, but is now supported by other work: see, e.g., Estimation of genetic architecture for complex traits using GWAS data.

We have recently finished analyzing height using L1-penalization and the phase transition technique on a very large data set (many hundreds of thousands of individuals). The paper has been submitted for review, and the results support the claims made above with s ~ 10k, h2 ~ 0.5 for height.

Added: Here are comments from "Donoho-Student":
Donoho-Student says:
September 14, 2017 at 8:27 pm GMT • 100 Words

The Donoho-Tanner transition describes the noise-free (h2=1) case, which has a direct analog in the geometry of polytopes.

The n = 30s result from Hsu et al. (specifically the value of the coefficient, 30, when p is the appropriate number of SNPs on an array and h2 = 0.5) is obtained via simulation using actual genome matrices, and is original to them. (There is no simple formula that gives this number.) The D-T transition had only been established in the past for certain classes of matrices, like random matrices with specific distributions. Those results cannot be immediately applied to genomes.

The estimate that s is (order of magnitude) 10k is also a key input.

I think Hsu refers to n = 1 million instead of 30 * 10k = 300k because the effective SNP heritability of IQ might be less than h2 = 0.5 — there is noise in the phenotype measurement, etc.

Donoho-Student says:
September 15, 2017 at 11:27 am GMT • 200 Words

Lasso is a common statistical method but most people who use it are not familiar with the mathematical theorems from compressed sensing. These results give performance guarantees and describe phase transition behavior, but because they are rigorous theorems they only apply to specific classes of sensor matrices, such as simple random matrices. Genomes have correlation structure, so the theorems do not directly apply to the real world case of interest, as is often true.

What the Hsu paper shows is that the exact D-T phase transition appears in the noiseless (h2 = 1) problem using genome matrices, and a smoothed version appears in the problem with realistic h2. These are new results, as is the prediction for how much data is required to cross the boundary. I don’t think most gwas people are familiar with these results. If they did understand the results they would fund/design adequately powered studies capable of solving lots of complex phenotypes, medical conditions as well as IQ, that have significant h2.

Most people who use lasso, as opposed to people who prove theorems, are not even aware of the D-T transition. Even most people who prove theorems have followed the Candes-Tao line of attack (restricted isometry property) and don’t think much about D-T. Although D eventually proved some things about the phase transition using high dimensional geometry, it was initially discovered via simulation using simple random matrices.

Wednesday, September 13, 2017

"Helicopter parents produce bubble wrapped kids"

Heterodox Academy. In my opinion these are reasonable center-left (Haidt characterizes himself as "liberal left") people whose views would have been completely acceptable on campus just 10 or 20 years ago. Today they are under attack for standing up for freedom of speech and diversity of thought.

Sunday, September 10, 2017

Bannon Unleashed


[ These embedded clips were annoyingly set to autoplay, so I have removed them. ]

Most of this short segment was edited out of the long interview shown on 60 Minutes (see video below).

Bannon denounces racism and endorses Citizenism. See also The Bannon Channel.

Paraphrasing slightly:
Economic nationalism inclusive of all races, religions, and sexual preferences -- as long as you're a citizen of our country.

The smart Democrats are trying to get the identity politics out of the party. The winning strategy will be populism -- the only question is whether it will be a left-wing populism or right-wing populism. We'll see in 2020.
This is the longer interview, with no quarter given to a dumbfounded Charlie Rose:

[ Clip removed ]

This is a 1 hour video that aired on PBS. There are amazing details about the 2016 campaign from Bannon the deep insider. If you followed the election closely you will be very interested in this interview. (In case this video is taken down you might find the content here.)

Varieties of Snowflakes

I was pleasantly surprised that New Yorker editor David Remnick and Berkeley law professor Melissa Murray continue to support the First Amendment, even if some of her students do not. Remnick gives Historian Mark Bray (author of Antifa: The Anti-Fascist Handbook) a tough time about the role of violence in political movements.

After Charlottesville, the Limits of Free Speech

David Remnick speaks with the author of a new and sympathetic book about Antifa, a law professor at University of California, Berkeley, and a legal analyst for Slate, to look at how leftist protests at Berkeley, right-wing violence in Charlottesville, and open-carry laws around the country are testing the traditional liberal consensus on freedom of expression

Thursday, September 07, 2017

BENEFICIAL AI 2017 (Asilomar meeting)

AI researcher Yoshua Bengio gives a nice overview of recent progress in Deep Learning, and provides some perspective on challenges that must be overcome to achieve AGI (i.e., human-level general intelligence). I agree with Bengio that the goal is farther than the recent wave of excitement might lead one to believe.

There were many other interesting talks at the BENEFICIAL AI 2017 meeting held in Asilomar CA. (Some may remember the famous Asilomar meeting on recombinant DNA in 1975.)

Here's a panel discussion Creating Human-level AI: How and When?

If you like speculative discussion, this panel on Superintelligence should be of interest:

Tuesday, September 05, 2017

DeepMind and StarCraft II Learning Environment

This Learning Environment will enable researchers to attack the problem of building an AI that plays StarCraft II at a high level. As observed in the video, this infrastructure development required significant investment of resources by DeepMind / Alphabet. Now, researchers in academia and elsewhere have a platform from which to explore an important class of AI problems that are related to real world strategic planning. Although StarCraft is "just" a video game, it provides a rich virtual laboratory for machine learning.
StarCraft II: A New Challenge for Reinforcement Learning

This paper introduces SC2LE (StarCraft II Learning Environment), a reinforcement learning environment based on the StarCraft II game. This domain poses a new grand challenge for reinforcement learning, representing a more difficult class of problems than considered in most prior work. It is a multi-agent problem with multiple players interacting; there is imperfect information due to a partially observed map; it has a large action space involving the selection and control of hundreds of units; it has a large state space that must be observed solely from raw input feature planes; and it has delayed credit assignment requiring long-term strategies over thousands of steps. We describe the observation, action, and reward specification for the StarCraft II domain and provide an open source Python-based interface for communicating with the game engine. In addition to the main game maps, we provide a suite of mini-games focusing on different elements of StarCraft II gameplay. For the main game maps, we also provide an accompanying dataset of game replay data from human expert players. We give initial baseline results for neural networks trained from this data to predict game outcomes and player actions. Finally, we present initial baseline results for canonical deep reinforcement learning agents applied to the StarCraft II domain. On the mini-games, these agents learn to achieve a level of play that is comparable to a novice player. However, when trained on the main game, these agents are unable to make significant progress. Thus, SC2LE offers a new and challenging environment for exploring deep reinforcement learning algorithms and architectures.

Friday, September 01, 2017

Lax on vN: "He understood in an instant"

Mathematician Peter Lax (awarded National Medal of Science, Wolf and Abel prizes), interviewed about his work on the Manhattan Project. His comments on von Neumann and Feynman:
Lax: ... Von Neumann was very deeply involved in Los Alamos. He realized that computers would be needed to carry out the calculations needed. So that was, I think, his initial impulse in developing computers. Of course, he realized that computing would be important for every highly technical project, not just atomic energy. He was the most remarkable man. I’m always utterly surprised that his name is not common, household.

It is a name that should be known to every American—in fact, every person in the world, just as the name of Einstein is. I am always utterly surprised how come he’s almost totally unknown. ... All people who had met him and interacted with him realized that his brain was more powerful than anyone’s they have ever encountered. I remember Hans Bethe even said, only half in jest, that von Neumann’s brain was a new development of the human brain. Only a slight exaggeration.

... People today have a hard time to imagine how brilliant von Neumann was. If you talked to him, after three words, he took over. He understood in an instant what the problem was and had ideas. Everybody wanted to talk to him.


Kelly: I think another person that you mention is Richard Feynman?

Lax: Yes, yes, he was perhaps the most brilliant of the people there. He was also somewhat eccentric. He played the bongo drums. But everybody admired his brilliance. [ vN was a consultant and only visited Los Alamos occasionally. ]
Full transcript. See also Another species, an evolution beyond man.

Wednesday, August 30, 2017

Normies Lament

Ezra Klein talks to Angela Nagle. It's still normie normative, but Nagle has at least done some homework.

Click the link below to hear the podcast.
From 4Chan to Charlottesville: where the alt-right came from, and where it's going

Angela Nagle spent the better part of the past decade in the darkest corners of the internet, learning how online subcultures emerge and thrive on forums like 4chan and Tumblr.

The result is her fantastic new book, Kill All the Normies: Online Culture Wars From 4Chan And Tumblr to Trump and the Alt-Right, a comprehensive exploration of the origins of our current political moment.

We talk about the origins of the alt-right, and how the movement morphed from transgressive aesthetics on the internet to the violence in Charlottesville, but we also discuss PC culture on the left, demographic change in America, and the toxicity of online politics in general. Nagle is particularly interested in how the left's policing of language radicalizes its victims and creates space for alt-right groups to find eager recruits, and so we dive deep into that.


Civilization and Its Discontents by Sigmund Freud

This Is Why We Can't Have Nice Things: Mapping the Relationship between Online Trolling and Mainstream Culture by Whitney Phillips

The Net Delusion: The Dark Side of Internet Freedom by Evgeny Morozov
Actually this interview with the Irish Times (not Ezra Klein) is much better:

Friday, August 25, 2017

Job Opening in Computational Genomics

A VC-funded genomics startup I am familiar with is searching for someone to apply computational methods to complex human traits (e.g., polygenic disease risk).

The ideal candidate would be someone from Physics or CS or other quantitative discipline, interested in computational genomics and data science. Strong background in computation required. Advanced degree a plus, but not required.

Location is in NJ, just outside NYC.

Send your resume to hsurecruits@gmail.com

Sunday, August 20, 2017

Ninety-nine genetic loci influencing general cognitive function

The paper below has something like 200 authors from over 100 institutions worldwide.

Many people claimed just a few years ago (or more recently!) that results like this were impossible. Will they admit their mistake?

In Scientific Consensus on Cognitive Ability? I described the current consensus among experts as follows.
0. Intelligence is (at least crudely) measurable
1. Intelligence is highly heritable (much of the variance is determined by DNA)
2. Intelligence is highly polygenic (controlled by many genetic variants, each of small effect)
3. Intelligence is going to be deciphered at the molecular level, in the near future, by genomic studies with very large sample size
See figures below for a summary of progress over the last six years. Note 4% of total variance = 1/25 and sqrt(1/25) = 1/5, so a predictor built from these variants would correlate ~0.2 with actual cognitive ability. There is still much more variance to be discovered with larger samples, of course.
Ninety-nine independent genetic loci influencing general cognitive function include genes associated with brain health and structure (N = 280,360)

General cognitive function is a prominent human trait associated with many important life outcomes including longevity. The substantial heritability of general cognitive function is known to be polygenic, but it has had little explication in terms of the contributing genetic variants. Here, we combined cognitive and genetic data from the CHARGE and COGENT consortia, and UK Biobank (total N=280,360). We found 9,714 genome-wide significant SNPs in 99 independent loci. Most showed clear evidence of functional importance. Among many novel genes associated with general cognitive function were SGCZ, ATXN1, MAPT, AUTS2, and P2RY6. Within the novel genetic loci were variants associated with neurodegenerative disorders, neurodevelopmental disorders, physical and psychiatric illnesses, brain structure, and BMI. Gene-based analyses found 536 genes significantly associated with general cognitive function; many were highly expressed in the brain, and associated with neurogenesis and dendrite gene sets. Genetic association results predicted up to 4% of general cognitive function variance in independent samples. There was significant genetic overlap between general cognitive function and information processing speed, as well as many health variables including longevity.

Chinese Social Media Notices US Cultural Revolution

The joke below is making the rounds on Chinese social media.

See Struggles at Yale and Baizuo = Libtard.

Also circulating on Chinese social media: A Report on the Cultural Revolution in the United States.

Yes, an entire country can go crazy for a decade...
Cultural Revolution (Wikipedia): The Cultural Revolution, formally the Great Proletarian Cultural Revolution, was a sociopolitical movement that took place in China from 1966 until 1976. Set into motion by Mao Zedong, then Chairman of the Communist Party of China, its stated goal was to preserve 'true' Communist ideology in the country by purging remnants of capitalist and traditional elements from Chinese society, and to re-impose Maoist thought as the dominant ideology within the Party. ...

The movement paralyzed China politically and negatively affected the country's economy and society to a significant degree. ...

Libraries full of historical and foreign texts were destroyed; books were burned. Temples, churches, mosques, monasteries, and cemeteries were closed down ...

The Bannon Channel

Rumor has it that Bannon will start a Breitbart TV channel to rival Fox News. Given the success of YouTube- / pod-casters like Joe Rogan (5 million downloads per episode), it's plausible this could be done with very modest capex (the channel could start out as pure streaming and only go to cable later). Billionaire Robert Mercer (Renaissance Technologies) is a likely backer.

(The headline above actually appeared on the front page of the Huffington Post when it was announced that Bannon would leave the White House. It was quickly replaced with the headline White Flight -- swing state voters and 2018 midterm election voters will not forget either headline, I predict.)

Bannon's new channel can denounce Richard Spencer, Nazis, KKK, etc. but still push economic nationalism and even mild white identity slogans like  white people have rights, too  or  everyone should be treated as an individual, based on merit ... (gee, that last one seems pretty principled... maybe they could add something catchy like the content of their character or something like that).

They can keep winning on a populist platform with just 4 messages:

1. Economic Nationalism
2. No foreign wars
3. Reform immigration
4. Stop PC excesses

Again, not really to the right of Fox, but just more consistent and less corporate GOP ("cuck") content. Tucker Carlson and Hannity could move to the new network without changing their schtick in the slightest.

Mercer would be smart to back this based entirely on financial ROI. There is massive pent-up demand: Trump got about half the popular vote, but currently no major media outlet is aligned with the views of his supporters.

Friday, August 18, 2017

Embryo Selection in China (Nature)

China’s embrace of embryo selection raises thorny questions (Nature)

Fertility centres are making a massive push to increase preimplantation genetic diagnosis in a bid to eradicate certain diseases.

... Early experiments are beginning to show how genome-editing technologies such as CRISPR might one day fix disease-causing mutations before embryos are implanted. But refining the techniques and getting regulatory approval will take years. PGD has already helped thousands of couples. And whereas the expansion of PGD around the world has generally been slow, in China, it is starting to explode.

... Genetic screening during pregnancy for chromosomal abnormalities linked to maternal age has taken off throughout the country, and many see this as a precursor to wider adoption of PGD.

Although Chinese fertility doctors were late to the game in adopting the procedure, they have been pursuing a more aggressive, comprehensive and systematic path towards its use there than anywhere else. The country’s central government, known for its long-term thinking, has over the past decade stepped up efforts to bring high-quality health care to the people, and its current 5-year plan has made reproductive medicine, including PGD, a priority ...

Comprehensive figures are difficult to come by, but estimates from leading PGD providers show that China’s use of the technique already outpaces that in the United States, and it is growing up to five times faster.

... there are concerns about the push to select for non-disease-related traits, such as intelligence or athletic ability. The ever-present spectre of eugenics lurks in the shadows. But in China, although these concerns are considered, most thoughts are focused on the benefits of the procedures.

... And the centres with licences to do PGD have created a buzz in their race to claim firsts with the technology. In 2015, CITIC-Xiangya boasted China’s first “cancer-free baby”. The boy’s parents had terminated a prior pregnancy after genetic testing showed the presence of retinoblastoma, a cancer that forms in the eyes during early development and often leads to blindness. In their next try, the couple used PGD to ensure that the gene variant that causes retinoblastoma wasn’t present. Other groups have helped couples to avoid passing on a slew of conditions: short-rib-polydactyly syndrome, Brittle-bone disease, Huntington’s disease, polycystic kidney disease and deafness, among others. ...

Joe Leigh Simpson, a medical geneticist at Florida International University in Miami, and former president of the Preimplantation Genetic Diagnosis International Society, is impressed by the quality and size of the Chinese fertility clinics. They “are superb and have gigantic units. They came out of nowhere in just 2 or 3 years,” he says. ...

People in China seem more likely to feel an obligation to bear the healthiest child possible than to protect an embryo. The Chinese appetite for using genetic technology to ensure healthy births can be seen in the rapid rise of pregnancy testing for Down’s syndrome and other chromosomal abnormalities. Since Shenzhen-based BGI introduced a test for Down’s syndrome in 2013, it has sold more than 2 million kits; half of those sales were in the past year.

... The Chinese word for eugenics, yousheng, is used explicitly as a positive in almost all conversations about PGD. Yousheng is about giving birth to children of better quality. Not smoking during pregnancy is also part of yousheng. ...
优生学 加油!

Wednesday, August 16, 2017

Meet the Bot: OpenAI and Dota 2

OpenAI has created a Dota 2 bot that plays at the level of human professionals. Humans can look forward to coexistence with increasingly clever AIs in both virtual and real world settings. See also Robots taking our jobs.
OpenAI: Dota 1v1 is a complex game with hidden information. Agents must learn to plan, attack, trick, and deceive their opponents. The correlation between player skill and actions-per-minute is not strong, and in fact, our AI’s actions-per-minute are comparable to that of an average human player.

Success in Dota requires players to develop intuitions about their opponents and plan accordingly. In the above video you can see that our bot has learned — entirely via self-play — to predict where other players will move, to improvise in response to unfamiliar situations, and how to influence the other player’s allied units to help it succeed.
About the game ("Defense of the Ancient").
Wikipedia: Dota 2 is played in matches between two teams of five players, with each team occupying and defending their own separate base on the map. Each of the ten players independently controls a powerful character, known as a "hero", who all have unique abilities and styles of play. During a match, a player and their team collects experience points and items for their heroes in order to fight through the opposing team's heroes and other defenses. A team wins by being the first to destroy a large structure located in the opposing team's base, called the "Ancient".
Related: this is a nice recent interview with Demis Hassabis of Deep Mind. He talks a bit about Go innovation resulting from AlphaGo.

Monday, August 14, 2017

Estimation of genetic architecture for complex traits using GWAS data

These authors extrapolate from existing data to predict sample sizes needed to identify SNPs which explain a large portion of heritability in a variety of traits. For cognitive ability (see red curves in figure below), they predict sample sizes of ~million individuals will suffice.

See also More Shock and Awe: James Lee and SSGAC in Oslo, 600 SNP hits.
Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits and implications for the future

Yan Zhang, Guanghao Qi, Ju-Hyun Park, Nilanjan Chatterjee (Johns Hopkins University)

Summary-level statistics from genome-wide association studies are now widely used to estimate heritability and co-heritability of traits using the popular linkage-disequilibrium-score (LD-score) regression method. We develop a likelihood-based approach for analyzing summary-level statistics and external LD information to estimate common variants effect-size distributions, characterized by proportion of underlying susceptibility SNPs and a flexible normal-mixture model for their effects. Analysis of summary-level results across 32 GWAS reveals that while all traits are highly polygenic, there is wide diversity in the degrees of polygenicity. The effect-size distributions for susceptibility SNPs could be adequately modeled by a single normal distribution for traits related to mental health and ability and by a mixture of two normal distributions for all other traits. Among quantitative traits, we predict the sample sizes needed to identify SNPs which explain 80% of GWAS heritability to be between 300K-500K for some of the early growth traits, between 1-2 million for some anthropometric and cholesterol traits and multiple millions for body mass index and some others. The corresponding predictions for disease traits are between 200K-400K for inflammatory bowel diseases, close to one million for a variety of adult onset chronic diseases and between 1-2 million for psychiatric diseases.

This figure shows predicted effect size distributions for a number of quantitative traits. You can see that height and intelligence are somewhat different, but dramatically so.

Thursday, August 10, 2017

Meanwhile, down on the Farm

The Spring 2017 issue of the Stanford Medical School magazine has a special theme: Sex, Gender, and Medicine. I recommend the article excerpted below to journalists covering the Google Manifesto / James Damore firing. After reading it, they can decide for themselves whether his memo is based on established neuroscience or bro-pseudoscience.

Perhaps top Google executives will want to head down the road to Stanford for a refresher course in reality.

Stanford Neuroscience Professor Nirao Shah and Diane Halpern, past president of the American Psychological Association, would both make excellent expert witnesses in the Trial of the Century.
Two minds: The cognitive differences between men and women

... Nirao Shah decided in 1998 to study sex-based differences in the brain ... “I wanted to find and explore neural circuits that regulate specific behaviors,” says Shah, then a newly minted Caltech PhD who was beginning a postdoctoral fellowship at Columbia. So, he zeroed in on sex-associated behavioral differences in mating, parenting and aggression.

“These behaviors are essential for survival and propagation,” says Shah, MD, PhD, now a Stanford professor of psychiatry and behavioral sciences and of neurobiology. “They’re innate rather than learned — at least in animals — so the circuitry involved ought to be developmentally hard-wired into the brain. These circuits should differ depending on which sex you’re looking at.”

His plan was to learn what he could about the activity of genes tied to behaviors that differ between the sexes, then use that knowledge to help identify the neuronal circuits — clusters of nerve cells in close communication with one another — underlying those behaviors.

At the time, this was not a universally popular idea. The neuroscience community had largely considered any observed sex-associated differences in cognition and behavior in humans to be due to the effects of cultural influences. Animal researchers, for their part, seldom even bothered to use female rodents in their experiments, figuring that the cyclical variations in their reproductive hormones would introduce confounding variability into the search for fundamental neurological insights.

But over the past 15 years or so, there’s been a sea change as new technologies have generated a growing pile of evidence that there are inherent differences in how men’s and women’s brains are wired and how they work.

... There was too much data pointing to the biological basis of sex-based cognitive differences to ignore, Halpern says. For one thing, the animal-research findings resonated with sex-based differences ascribed to people. These findings continue to accrue. In a study of 34 rhesus monkeys, for example, males strongly preferred toys with wheels over plush toys, whereas females found plush toys likable. It would be tough to argue that the monkeys’ parents bought them sex-typed toys or that simian society encourages its male offspring to play more with trucks. A much more recent study established that boys and girls 9 to 17 months old — an age when children show few if any signs of recognizing either their own or other children’s sex — nonetheless show marked differences in their preference for stereotypically male versus stereotypically female toys.

Halpern and others have cataloged plenty of human behavioral differences. “These findings have all been replicated,” she says.

... “You see sex differences in spatial-visualization ability in 2- and 3-month-old infants,” Halpern says. Infant girls respond more readily to faces and begin talking earlier. Boys react earlier in infancy to experimentally induced perceptual discrepancies in their visual environment. In adulthood, women remain more oriented to faces, men to things.

All these measured differences are averages derived from pooling widely varying individual results. While statistically significant, the differences tend not to be gigantic. They are most noticeable at the extremes of a bell curve, rather than in the middle, where most people cluster. ...

See also Gender differences in preferences, choices, and outcomes: SMPY longitudinal study. These preference asymmetries are not necessarily determined by biology. They could be entirely due to societal influences. But nevertheless, they characterize the pool of human capital from which Google is trying to hire.
The recent SMPY paper below describes a group of mathematically gifted (top 1% ability) individuals who have been followed for 40 years. This is precisely the pool from which one would hope to draw STEM and technological leadership talent. There are 1037 men and 613 women in the study.

The figures show significant gender differences in life and career preferences, which affect choices and outcomes even after ability is controlled for. (Click for larger versions.) According to the results, SMPY men are more concerned with money, prestige, success, creating or inventing something with impact, etc. SMPY women prefer time and work flexibility, want to give back to the community, and are less comfortable advocating unpopular ideas. Some of these asymmetries are at the 0.5 SD level or greater. Here are three survey items with a ~ 0.4 SD or more asymmetry:

# Society should invest in my ideas because they are more important than those of other people.

# Discomforting others does not deter me from stating the facts.

# Receiving criticism from others does not inhibit me from expressing my thoughts.

I would guess that Silicon Valley entrepreneurs and leading technologists are typically about +2 SD on each of these items! One can directly estimate M/F ratios from these parameters ...
For example, if a typical male SV entrepreneur / tech leader is roughly +2SD on these traits whereas a female is +2.5SD, the population fraction would be 3:1 or 4:1 larger for males. This doesn't mean that the females who are > +2.5SD (in the female population) are ill-suited to the role (they may be as good as the men), just that there are fewer of them in the general population. I was shocked to see that even top Google leadership didn't understand this point that Damore tried to make in his memo.

A 6ft3 Asian-American guard (Jeremy Lin) might be just as good as other guards in the NBA, but the fraction of Asian-American males who are 6ft3 is smaller than for other groups, like African-Americans. Even if there were no discrimination against Asian players, you'd expect to see fewer (relative to base population) in the NBA due to the average height difference.

Behold the Brogrammer: James Damore (Bloomberg video)

Watch a few minutes of this Bloomberg interview and I think you'll agree he's both sincere and well-meaning, if a bit naive about the buzzsaw he has stepped into. Definitely not a brogrammer.

He reminds me of Richard Hendricks of the HBO show Silicon Valley.

See also Damore vs Google: Trial of the Century? and In the matter of James Damore, ex-Googler

Damore vs Google: Trial of the Century?

In his memo, James Damore asserts that Google is engaged in illegal discriminatory practices as part of its efforts to increase diversity. (See earlier post, In the matter of James Damore, ex-Googler.)

The image below is from the actual memo. Does Damore sound like a sexist brogrammer Neanderthal?

OKRs = Objectives and Key Results. Damore is pointing out that pro-diversity objectives may incentivize managers to discriminate by gender or race in hiring and promotion.

According to Margot Cleveland (attorney who teaches labor law at Notre Dame):
The Federalist: ... Damore wrote “Google has created several discriminatory practices.” This reads of a classic case of opposition to an unlawful employment practice. (Under the case law, the practice need not actually be illegal if the employee reasonably believed it discriminatory.)

This passage may well be Google’s undoing. Damore can present a prima facie case of illegal retaliation: he engaged in protected activity by opposing several discriminatory practices, and was fired from his job. The close temporal nexus creates an inference that Google fired him because of his opposition to illegal discrimination.

... Google will counter that it fired him not because of his opposition but because of the gender stereotypes he included in the memo.
But of course the Google Brain was simultaneously using these "stereotypes" = correlations as its core revenue driver:

Professor Cleveland concludes:
... Once before a jury, Google will be hard-pressed to justify Damore’s firing because the jury will be force-fed the actual words Damore wrote, not the press’ hysterical gloss. In this regard, Google was in a no-win situation: Once the Neanderthal narrative formed, Google had no real choice but to fire Damore—which doesn’t make it right or, as Google is likely to find out soon, legal. In the meantime, the rest of the country will be treated to a nice civics refresher course and a deep-dive into federal employment and labor law.
Not to mention a deep-dive into the science of statistical / distributional group differences!

Bloomberg video interview with Damore.

Tuesday, August 08, 2017

In the matter of James Damore, ex-Googler

James Damore, Harvard PhD* in Systems Biology, and (until last week) an engineer at Google, was fired for writing this memo: Google’s Ideological Echo Chamber, which dares to display the figure above.

Here is Damore's brief summary of his memo (which contains many citations to original scientific research), and the conclusion:
Google’s political bias has equated the freedom from offense with psychological safety, but shaming into silence is the antithesis of psychological safety.
● This silencing has created an ideological echo chamber where some ideas are too sacred to be honestly discussed.
● The lack of discussion fosters the most extreme and authoritarian elements of this ideology.
○ Extreme: all disparities in representation are due to oppression
○ Authoritarian: we should discriminate to correct for this oppression
● Differences in distributions of traits between men and women may in part explain why we don't have 50% representation of women in tech and leadership.
● Discrimination to reach equal representation is unfair, divisive, and bad for business.

I hope it’s clear that I’m not saying that diversity is bad, that Google or society is 100% fair, that we shouldn’t try to correct for existing biases, or that minorities have the same experience of those in the majority. My larger point is that we have an intolerance for ideas and evidence that don’t fit a certain ideology. I’m also not saying that we should restrict people to certain gender roles; I’m advocating for quite the opposite: treat people as individuals, not as just another member of their group (tribalism).
This actual excerpt is of course very different from the heavily biased (mendacious) characterizations of the memo in the (lying) media. Perhaps that should make you wonder about the reliability of mainstream accounts concerning this matter.

Damore correctly anticipated his own demise! CEO Sundar Pichai's company-wide message seems to ban almost all scientific discussion of statistical or distributional group differences, on threat of termination:
This has been a very difficult time. I wanted to provide an update on the memo that was circulated over this past week.

First, let me say that we strongly support the right of Googlers to express themselves, and much of what was in that memo is fair to debate, regardless of whether a vast majority of Googlers disagree with it. However, portions of the memo violate our Code of Conduct and cross the line by advancing harmful gender stereotypes in our workplace. Our job is to build great products for users that make a difference in their lives. To suggest a group of our colleagues have traits that make them less biologically suited to that work is offensive and not OK. It is contrary to our basic values and our Code of Conduct, which expects “each Googler to do their utmost to create a workplace culture that is free of harassment, intimidation, bias and unlawful discrimination.”

The memo has clearly impacted our co-workers, some of whom are hurting and feel judged based on their gender. Our co-workers shouldn’t have to worry that each time they open their mouths to speak in a meeting, they have to prove that they are not like the memo states, being “agreeable” rather than “assertive,” showing a “lower stress tolerance,” or being “neurotic.”

At the same time, there are co-workers who are questioning whether they can safely express their views in the workplace (especially those with a minority viewpoint). They too feel under threat, and that is also not OK. People must feel free to express dissent. So to be clear again, many points raised in the memo—such as the portions criticizing Google’s trainings, questioning the role of ideology in the workplace, and debating whether programs for women and underserved groups are sufficiently open to all—are important topics. The author had a right to express their views on those topics—we encourage an environment in which people can do this and it remains our policy to not take action against anyone for prompting these discussions. ...
Larry Summers was fired from the Harvard presidency (at least in part) for pointing out (correctly, it seems) that males exhibit higher variance in performance on cognitive tests (more very low- and high-scoring men than women per capita). His detractors justified the termination due to his highly public and symbolic role as the leader of the institution. In contrast, Damore was simply an engineer (with a background in computational biology) expressing his opinion on some basic scientific questions still under active investigation by researchers all over the world. His firing has to be regarded as scary authoritarian policing of thought.

See also Bounded Cognition, Gender differences in preferences, choices, and outcomes, 2:1 faculty preference for women on STEM tenure track (PNAS), and Gender trouble in the valley.

A literature review at Slate Star Codex.

If I worked at Google would this blog post get me fired?

Note Added:

* Damore may be ABD (left Harvard before completing his dissertation) rather than a PhD.

Damore is going to fight Google in court (NYTimes):
Mr. Damore, who worked on infrastructure for Google’s search product, said he believed that the company’s actions were illegal and that he would “likely be pursuing legal action.”

“I have a legal right to express my concerns about the terms and conditions of my working environment and to bring up potentially illegal behavior, which is what my document does,” Mr. Damore said.

Before being fired, Mr. Damore said, he had submitted a complaint to the National Labor Relations Board claiming that Google’s upper management was “misrepresenting and shaming me in order to silence my complaints.” He added that it was “illegal to retaliate” against an N.L.R.B. charge.
According to The Federalist, Damore has a case. This trial of the century might expose large numbers of people to the ideas in his memo...

Bloomberg video interview with Damore.

Monday, July 31, 2017

Robots taking our jobs

The figures below are from the recent paper Robots and Jobs: Evidence from US Labor Markets, by Acemoglu and Restrepo.

VoxEU discussion:
... Estimates suggest that an extra robot per 1000 workers reduces the employment to population ratio by 0.18-0.34 percentage points and wages by 0.25-0.5%. This effect is distinct from the impacts of imports, the decline of routine jobs, offshoring, other types of IT capital, or the total capital stock.
If the robot does the work of a few workers, that explains the fraction of a percent (negative) effect on employment and compensation in a model with direct substitution of robot labor for human work, and smaller second order (positive) effect from comparative advantage of humans in complementary jobs. This is not the optimistic scenario where buggy whip makers displaced by the automobile easily find good new jobs in the expanded economy. We can expect to see many more robots (and virtual AI robots) per 1000 workers in the near future.

Related talk at HKUST by Harvard labor economist Richard Freeman: Work and Income in the Age of Robots and AI. This time it's different?

Here's Richard in 2011 when we were working on a project at Alibaba headquarters :-)

China's rise in Science and Engineering indicators (NSF)

Data from the 2016 NSF report on global Science & Engineering Indicators shows the rapid rise of China in both academic science and applied technology.

Rapid growth in number of Chinese S&E articles, reaching parity with US in 2013, and well ahead of Japan and India.

Fraction of high impact (top 1% most cited) papers highest for US research (~1.9%). China and Japan comparable at ~0.8% as of 2012. China's fraction roughly doubled between 2001 and 2012.

As of today total number of high impact papers is still probably ~2:1 in favor of US. But I think most people would be surprised to see that China has caught up with (surpassed?) Japan in this quality metric.

US and China now each account for ~30% of global high tech value-added manufacturing. Value-added means net of input components -- going beyond simple assembly.

Sunday, July 30, 2017

Like little monkeys: How the brain does face recognition

This is a Caltech TEDx talk from 2013, in which Doris Tsao discusses her work on the neuroscience of human face recognition. Recently I blogged about her breakthrough in identifying the face recognition algorithm used by monkey (and presumably human) brains. The algorithm seems similar to those used in machine face recognition: individual neurons perform feature detection just as in neural nets. This is not surprising from a purely information-theoretic perspective, if we just think about the space of facial variation and the optimal encoding. But it is amazing to be able to demonstrate it by monitoring specific neurons in a monkey brain.

An earlier research claim (which, four years ago, she recapitulates @8:50min in the video), that certain neurons are sensitive only to specific faces, seems not to be true. I always found it implausible.

On her faculty web page Tsao talks about her decision to attend Caltech as an undergraduate:
One day, my father went on a trip to California and took a tour of Caltech with a friend. He came back and told me about a monastery for science, located under the mountains amidst flowers and orange trees, where all the students looked very skinny and super smart, like little monkeys. I was intrigued. I went to a presentation about Caltech by a visiting admissions officer, who showed slides of students taking tests under olive trees, swimming in the Pacific, huddled in a dorm room working on a problem set... I decided: this is where I want to go to college! I dreamed every day about being accepted to Caltech. After I got my acceptance letter, I began to worry that I would fall behind in the first year, since I had heard about how hard the course load is. So I went to the library and started reading the Feynman Lectures. This was another world…where one could see beneath the surface of things, ask why, why, why, why? And the results of one’s mental deliberations actually could be tested by experiments and reveal completely unexpected yet real phenomena, like magnetism as a consequence of the invariance of the speed of light.
See also Feynman Lectures: Epilogue and Where Men are Men, and Giants Walk the Earth.

Thursday, July 27, 2017

First Human Embryos Edited in U.S. (MIT Technology Review)

It's only a matter of time... Note this kind of work can be done very secretly and with very modest resources -- it does not require banks of centrifuges, big reactors, or ICBM test launches.
First Human Embryos Edited in U.S. (MIT Technology Review)

Researchers have demonstrated they can efficiently improve the DNA of human embryos.

The first known attempt at creating genetically modified human embryos in the United States has been carried out by a team of researchers in Portland, Oregon, MIT Technology Review has learned.

The effort, led by Shoukhrat Mitalipov of Oregon Health and Science University, involved changing the DNA of a large number of one-cell embryos with the gene-editing technique CRISPR, according to people familiar with the scientific results.

Until now, American scientists have watched with a combination of awe, envy, and some alarm as scientists elsewhere were first to explore the controversial practice. To date, three previous reports of editing human embryos were all published by scientists in China.

Now Mitalipov is believed to have broken new ground both in the number of embryos experimented upon and by demonstrating that it is possible to safely and efficiently correct defective genes that cause inherited diseases.

... Some critics say germline experiments could open the floodgates to a brave new world of “designer babies” engineered with genetic enhancements—a prospect bitterly opposed by a range of religious organizations, civil society groups, and biotech companies.

The U.S. intelligence community last year called CRISPR a potential "weapon of mass destruction.”

... Mitalipov and his colleagues are said to have convincingly shown that it is possible to avoid both mosaicism and “off-target” effects, as the CRISPR errors are known.

A person familiar with the research says “many tens” of human IVF embryos were created for the experiment using the donated sperm of men carrying inherited disease mutations.
Work on cognitive enhancement will probably be done first in monkeys, proving Planet of the Apes prophetic within the next decade or so :-)
His team’s move into embryo editing coincides with a report by the U.S. National Academy of Sciences in February that was widely seen as providing a green light for lab research on germline modification.

The report also offered qualified support for the use of CRISPR for making gene-edited babies, but only if it were deployed for the elimination of serious diseases.

The advisory committee drew a red line at genetic enhancements—like higher intelligence. “Genome editing to enhance traits or abilities beyond ordinary health raises concerns about whether the benefits can outweigh the risks, and about fairness if available only to some people,” said Alta Charo, co-chair of the NAS’s study committee and professor of law and bioethics at the University of Wisconsin–Madison.

In the U.S., any effort to turn an edited IVF embryo into a baby has been blocked by Congress, which added language to the Department of Health and Human Services funding bill forbidding it from approving clinical trials of the concept.

Despite such barriers, the creation of a gene-edited person could be attempted at any moment, including by IVF clinics operating facilities in countries where there are no such legal restrictions.

Tuesday, July 25, 2017

Natural Selection and Body Shape in Eurasia

Prior to the modern era of genomics, it was claimed (without good evidence) that divergences between isolated human populations were almost entirely due to founder effects or genetic drift, and not due to differential selection caused by disparate local conditions. There is strong evidence now against this claim. Many of the differences between modern populations arose over relatively short timescales (e.g., ~10ky), due to natural selection.
Polygenic Adaptation has Impacted Multiple Anthropometric Traits

Jeremy J Berg, Xinjun Zhang, Graham Coop
doi: https://doi.org/10.1101/167551

Most of our understanding of the genetic basis of human adaptation is biased toward loci of large phenotypic effect. Genome wide association studies (GWAS) now enable the study of genetic adaptation in highly polygenic phenotypes. Here we test for polygenic adaptation among 187 world- wide human populations using polygenic scores constructed from GWAS of 34 complex traits. By comparing these polygenic scores to a null distribution under genetic drift, we identify strong signals of selection for a suite of anthropometric traits including height, infant head circumference (IHC), hip circumference (HIP) and waist-to-hip ratio (WHR), as well as type 2 diabetes (T2D). In addition to the known north-south gradient of polygenic height scores within Europe, we find that natural selection has contributed to a gradient of decreasing polygenic height scores from West to East across Eurasia, and that this gradient is consistent with selection on height in ancient populations who have contributed ancestry broadly across Eurasia. We find that the signal of selection on HIP can largely be explained as a correlated response to selection on height. However, our signals in IHC and WC/WHR cannot, suggesting a response to selection along multiple axes of body shape variation. Our observation that IHC, WC, and WHR polygenic scores follow a strong latitudinal cline in Western Eurasia support the role of natural selection in establishing Bergmann's Rule in humans, and are consistent with thermoregulatory adaptation in response to latitudinal temperature variation.
From the paper:
... To explore whether patterns observed in the polygenic scores were caused by natural selection, we tested whether the observed distribution of polygenic scores across populations could plausibly have been generated under a neutral model of genetic drift ...



The study of polygenic adaptation provides new avenues for the study of human evolution, and promises a synthesis of physical anthropology and human genetics. Here, we provide the first population genetic evidence for selected divergence in height polygenic scores among Asian populations. We also provide evidence of selected divergence in IHC and WHR polygenic scores within Europe and to a lesser extent Asia, and show that both hip and waist circumference have likely been influenced by correlated selection on height and waist-hip ratio. Finally, signals of divergence among Asian populations can be explained in terms of differential relatedness to Europeans, which suggests that much of the divergence we detect predates the major demographic events in the history of modern Eurasian populations, and represents differential inheritance from ancient populations which had already diverged at the time of admixture. ...

Blog Archive