Over the past century, the labor market in America has seen a dramatic shift from blue-collar to white-collar work. According to the Bureau of Census Data, white-collar work in the US grew from 17.6% of total employment in 1900 to 59.9% in 2003 . The “white-collar” categorization was then discontinued for lack of specificity, but the fact remains that the majority of the American workforce is now employed in work where brainpower is more relevant than muscle power, a situation opposite to the way things were a century ago.
As knowledge work was displacing manual labor as the most common form of employment in America, the amount of formal education completed by Americans also grew rapidly. Before World War 2, only a quarter of American adults had graduated high school, and a vanishingly small percent had gone to college. Americans like George Washington, Cornelius Vanderbilt and Thomas Edison rose to the heights of politics, business and science with little to no formal education. But that state of affairs changed quickly after the end of World War 2 with the introduction of the GI bill, which paid for postsecondary education for millions of veterans. 75 years later, a large majority of American adults now hold high school degrees, and over a third hold bachelor’s degrees.
As Americans have begun to attain higher levels of formal educational credentials, so too have jobs begun to demand higher levels of credentials as a prerequisite. Some of this is because some jobs really do require extensive formal education: patients might not want to be operated on by a neurosurgeon who had learned exclusively on-the-job! But many white-collar jobs, like sales and clerical roles, which do not require extensive theoretical training and traditionally were not filled by college graduates, are starting to require higher degrees in a phenomenon often referred to as degree inflation .
Some of this is rational from the point of view of the employer, as college graduates can be expected to have higher levels of skills applicable to office work than high school graduates or dropouts, if only due to their four extra years of experience performing a type of knowledge work in college. But spending four years writing papers on art history with the goal of landing a sales job is not efficient from the point of view of either the prospective employer, who limits their labor pool and ends up paying a premium for entry-level labor, or the prospective employee, who incurs often-heavy tuition fees and four years of opportunity cost before starting a career.
A better model exists in the form of apprenticeships. For centuries, teenagers have learned skilled blue-collar trades in collaboration with more experienced mentors, and have emerged into young adulthood as full-fledged professionals in their chosen field. As an added benefit, apprentices can be productive in the lower-skilled parts of a job almost from the get-go, implicitly paying for their own training and earning a wage as their counterparts in college instead pay tuition.
This model has occasionally been applied to white-collar professions in countries like Germany and the UK, in which formal apprenticeships in the traditional mold exist for higher-skilled jobs like IT system administration and CNC machining. Those programs, where they exist, are generally coordinated to some degree by the relevant government, but America’s culture would likely pair better with a privatized apprenticeship model, which could fill niches unseen by a government administrator.
In such a model, companies large enough to sustain internal apprenticeship programs would designate appropriate roles for which they would hire apprentices, like sales or web development. For those roles, they would hire recent high school graduates or even dropouts, who would commit to the apprenticeship program for several years.
The apprentices would be assigned to a team and mentor, just like a typical intern or co-op student worker, with the key difference being that they would stay on each team for one to several years, rotating as appropriate to learn different aspects of their chosen profession.
This on-the-job training would be paired with classroom training, where each cohort of apprentices would be instructed in relevant skills alongside their day job. This might include things like written and oral communication for sales apprentices, psychology and anthropology for marketers, and computer science and graphic design for web developers. Some of these courses could be conducted internally by the company’s more-senior employees; others could be outsourced to local colleges in a reverse co-op arrangement.
The expectation would be that after completion of the apprenticeship program, some of its alumni would keep working in their new profession, while others would then go to college to pursue a broader and deeper formal education. Those that kept working would have the advantage of a four-year head start in their career trajectory and a much better financial situation than a recent college graduate; those that went to college after all would have a very compelling college admissions packet, transfer credits to the extent their employer could negotiate for them with universities, and a much better idea of what is worth studying in college than someone right out of high school. Talented students from poor families and poor schools might benefit to an especial degree from this sort of program, which would let them gain a stable financial footing and a better understanding than that provided by their high school of how to navigate a future college education. In this way, an apprenticeship program would play a similar role to that which the military plays for many teenagers today, but would prepare them for a rather different sort of work.
Of course, some apprentices would probably flounder partway through the program, or conversely be tempted to leave for a competitor. Employers could protect their investment by a similar mechanism to what West Point does for cadets: apprentices could leave at any point during the first year with no strings attached, but those that left later on would have to pay back the expenses incurred in training them. The apprenticeship program could also have a contractual requirement to work for the employer for a certain number of years after completing the program, with the alternative option of paying a financial penalty. Employers could also protect their investments in their apprentices by paying the college tuition of apprenticeship alumni who wished to go to college afterwards, under the condition that they return for a certain number of years after completing their degrees – much like many employers pay for business school for their employees under the condition that they come back afterwards.
This system of white-collar apprenticeships would have significant advantages for both the employer and the apprentice.
The employer would be able to attract some of the most talented and driven teenagers with a unique value proposition and thereby gain a recruiting advantage over competitors that wait to hire much competed-over college graduates. The apprentices, once recruited, would also have the value of being able to perform necessary but less-skilled work that must currently be done by older employees for whom it is tremendously boring. Once done with their contractual term, a number of apprentices could be expected to stick around and keep working for the employer and delivering value for years to come, assuming the employer did a good enough job to keep offering opportunities for advancement and a good work culture.
For high school students, the apprenticeship program would represent a unique opportunity to learn a profession with great career opportunities while earning a living straight out of high school, while keeping options open for a college education and even improving them. Right now, many high school graduates go off to college to study things which they will never use again, and make friends there with people who all too often end up moving to different cities and drifting apart over time. Instead, a well-run apprenticeship program would let students learn a gainful profession and the theoretical knowledge behind it while earning money from day one, and to use some of the most social years of their lives to form bonds with friends who would be far more likely to stay in the same industry and city and remain close social and professional contacts for decades.
Starting a program like this would surely lead to howls of disapproval from those who see the one-size-fits-all track of formal education as the right way for everyone, but the company that started it would be more than compensated for that negative attention by the newfound stream of talent it would be able to access. Done right, an apprenticeship program would prove its worth in a matter of years, and would surely spawn numerous competitors – exactly what our economy needs in this era of degree inflation.
The only question is, which company will seize the opportunity to go first? 
“Quantity has a quality of its own” –Attribution contested
A significant factor in the power of states throughout history has been sheer numbers. Spain was able to control an overseas empire to a significantly greater degree than Portugal thanks in no small part to its larger population. England subsequently rose in power and grew to control an empire of its own aided by a population boom in the 19th century. And the United States supplanted the United Kingdom as the world’s great power as its population vastly outgrew that of its former colonial overlord.
Of course, many factors other than population influence the power of states, including economic productivity and the strength of institutions. But population is a multiplier for those factors, and a country with a large enough population can exercise comparable or greater power than more developed but less populous countries. For a long time, this was the role that Russia played in Europe: less advanced technologically and economically than much of Western Europe, but a great power by sheer force of numbers.
The most important story in geopolitics today is the rise of China (population 1.3B) and its challenge to the supremacy of the United States (population 300M). The rise of China as a peer superpower to the United States is all but accomplished, and it appears that we are returning to a bipolar world, this time characterized by competition by the US and China, just as 1946-1991 was characterized by competition between the US and USSR.
However, if China succeeds in sustaining its growth trajectory economically and militarily, it will grow to overshadow the United States, wielding the same ~5x population advantage over the US that the US now enjoys over its predecessor power, the United Kingdom. This also means that the repressive and authoritarian Chinese model will increasingly prevail over the free democratic model that America has championed.
However, America is not the most populous democracy in the world. That honor belongs to India. India is forecast to surpass China in population in the next decade, and in the next few decades to grow almost 50% more populous than China. India currently punches below its weight on the world stage due to slow economic development: 30 years ago, its GDP per capita was similar to China’s, but is now 5x lower. However, if India were to enter a period of similarly high growth over the next 30 years as China has for the past 30, it would quickly become one of the most powerful countries in the world thanks to the scaling factor of its immense population. Moreover, India’s population is forecast to continue growing quickly, while China’s is forecast to shrink, and that will only compound any advantages that India accumulates.
In a world that is quickly going from unipolar to multipolar, it is worth considering which states will wield influence in the century to come, and on behalf of which values (if any, other than self-interest!) they will wield it. If China rises to heights of power that eclipse the US completely, only India may be strong enough to speak for liberal democracy. It is therefore in the interest of the US and other Western powers to develop closer ties to India and encourage its economic development, so as to nurture a counterweight to the rise of authoritarianism in the 21st century.
Note: This post sparked a great conversation on Hacker News! See the comments here.
You’ve probably seen some pretty outrageous claims about what artificial intelligence (AI) can do. Maybe it’s been media headlines about how AI will soon be able to do everything humans can and will put us all out of work. Maybe it’s been company press releases about how their revolutionary AI technology will cure cancer or send rockets to the moon.
Those claims, and many others like them, are largely or entirely untrue, and are designed to draw attention rather than to reflect what is actually possible or likely to be possible soon. But AI really has made incredible advances in the last decade, and will continue to drive tremendous changes across our economy and society in the years to come.
Given that rapid progress, it’s important to be able to tell the outrageous claims about AI from those that are more credible, but how? Part of the reason I went to Stanford for grad school was to learn to do just that, and after doing quite a bit of AI research in different subfields, I’ve gained an intuition for what modern AI methods can really do. And by spending time in the world of startups, I’ve combined the more theoretical knowledge from the lab with a high-level understanding of how the advancements in research are being applied to the real world. I’ve found myself in many conversations with people wondering about exactly this question: which capabilities is AI quickly developing, and where will they be applicable? This is my attempt to distill those conversations into written form, and to spread a better understanding of the capabilities of AI in today’s world.
I’ve written this essay to be approachable by those without a background in computer science or statistics, but you’ll probably get the most out of it if you have a quantitative background. You’ll probably find it especially interesting if you’re in a STEM field and are wondering how AI is likely to change the way things are done in your world, or are in business and entrepreneurship and are wondering which opportunities AI is now opening up, and which older ways of doing things it’s threatening to make obsolete!
Update: Some readers have noted that this essay largely omits AI techniques other than deep learning. That’s true! Earlier waves of AI research are important both in a historical sense and because they’ve yielded many approaches that are still relevant to this day. Even after the advent of deep learning, many real-world problems are better solved by battle-tested techniques like logistic regression, decision trees, or any of a number of others. However, those techniques worked just as well ten years ago as they do today. The aim of this essay is to outline what is newly possible or becoming possible thanks to the current deep learning era of AI, and to draw a distinction between those real possibilities and problems that are either intractable or solvable without the need for deep learning techniques.
The current wave of AI
In the decades following the emergence of computers, the term AI has shifted meaning repeatedly. Over and over, computers have gained new capabilities – like winning at chess, or answering simple questions – thanks to the application of new techniques, or just greater availability of processing power. When those capabilities appear novel enough, they often generate a wave of excitement around AI, and the public conversation around AI becomes centered on those new capabilities, largely to the exclusion of previously novel but now-mundane techniques that had generated previous waves of excitement.
The current wave of excitement centers on a set of techniques known as deep learning, which have unlocked unprecedented performance in a wide range of real-world applications. Deep learning rests on a surprisingly simple technique: if you stack many very simple functions (a couple of examples in one dimension are y = 2x, or y = tanh(x)) one after the other, you can nudge, or train, the resulting many-layered function, or neural network, to map complex inputs to outputs with surprising accuracy over time. For example, if you want to tell pictures of cats apart from pictures of dogs, you might train your network to output 1s for dogs and 0s for cats. To conduct one step of training, you might input a photo of a dog into the network, and then if the output was incorrectly closer to 0 than to 1, nudge the layer functions of the network in a direction that will make the final output of the network slightly smaller. For example, this might mean nudging a function that is currently y=2x to be y=1.9x. With some tuning, repeating this process thousands or even millions of times is likely to yield a network that can tell cats from dogs with surprisingly high accuracy.
Of course, not all techniques currently referred to as AI are based on this deep learning approach, and the general excitement around AI has created plenty of nonsense — the joke goes that many startups tell the public that they’re an AI company, tell investors that they use machine learning, and internally use logistic regression (a simple and decades-old statistical technique) for some predictive task that may not even be core to their business model. A fictional example of this might be an online used-clothing marketplace, probably called something like mylooks.ai, that claims to be an AI company but in reality does nothing more than using simple statistics to predict when to send marketing emails to users to drive the most engagement.
But don’t be fooled by the hype: deep learning really has changed the game in terms of what AI can do in the real world. Computational techniques before deep learning were very good at working with structured data (tables, databases, etc.), but much less good at unstructured data (images, video, audio, text, etc.), which is often very important in the real world. Deep learning, unlike older approaches, is very good at dealing with unstructured data, and that is where its power lies. Tasks that were previously hard or impossible to do reliably, like image identification, have suddenly become almost trivial, unlocking many applications. This comic (https://xkcd.com/1425/), from less than a decade ago, is already out of date: identifying a bird in a photo is now straightforward!
All of these new abilities do come with a caveat: deep learning performance depends on huge amounts of data and computational power. The techniques of deep learning themselves have existed for decades, with limited applications like reading handwritten addresses for the postal service. But the limited availability of data and computational power hampered their performance until the 2010s, when two things happened. One was that the maturation of the Internet made vast amounts of text, images, and other unstructured data available. The other was the increasing performance of GPU chips, which were originally designed for gaming (I remember installing them in my gaming PC growing up!) but which, through a lucky accident, turned out to be incredibly useful for the acceleration of deep learning algorithms. When those two factors came together, deep learning made sudden and large gains in performance, which started drawing significant attention in 2012 when the AlexNet program smashed records on an image recognition challenge.
The resulting attention drew in huge numbers of researchers and engineers in both academia and industry, and there has since been incredible progress in both fundamental AI research and downstream applications. Unfortunately, the attention has also created tremendous amounts of unwarranted hype, especially in industry but even in academia. The stakes are high to be able to tell one from the other, whether you’re an engineer deciding whether to work at a company that claims to be developing commercially-relevant AI, a policymaker forecasting changes in employment numbers, or an entrepreneur trying to tell a real opportunity from a mirage.
So, what’s the best way to tell the real AI applications from the fake ones? The best strategy is to keep a finger on the pulse of what the research community is doing. With few exceptions, ideas that are deployed in industry are first published in the research literature, and then deployed (with modifications when needed) to a similar or analogous real-world task. Thus, an idea for an AI application that’s adjacent to something that’s already been published as a research finding is likely to be worth investigating, but one that’s totally unrelated to anything in the literature is best viewed with skepticism.
There are significant nuances to which ideas are truly adjacent to each other – telling cats from dogs is an extremely similar task to telling defective machine parts from functional ones, but classifying a sentence as happy or sad is much easier than classifying it as insightful or inane. However, gaining a broad familiarity with what the different research communities in the world of AI are working on is a great way to gain an intuition for what is truly possible or likely to soon be possible with AI, and that’s what the rest of this essay is about.
Before we turn to a discussion of the communities in the world of AI research, it’s worth going over a few foundational ideas.
A term that’s closely associated with AI is machine learning (ML), which refers to AI approaches which learn from data, as opposed to being pre-programmed by humans. As most modern approaches to AI rely heavily or entirely on ML, the two terms have become synonymous in common usage.
An AI model is a configuration of functions which can be, or have been, trained on some task. For example, AlexNet is an image recognition model composed of many functional layers. You might train a previously untrained copy of AlexNet on millions of photos to classify them into categories, or you might use a copy of AlexNet that’s already been trained on millions of images to help you tell cats from dogs.
Deep learning techniques are applicable to a broad range of machine learning tasks, which can be roughly classified into many categories. A number of these, briefly described here, are commonly encountered, and worth knowing.
Reinforcement learning (RL) involves step-by-step decision-making by a model, e.g. the controls software for a robot which plays ping-pong. At every time step, an RL model has some information about the state of the world (e.g. the position of the paddle and the ball) and takes some action (e.g. moving the paddle to the right) based on its policy, which is the term used to denote the program that picks actions based on states. The policy is trained to maximize a reward signal, which is encountered intermittently (e.g. +1 reward for winning a point, -1 reward for losing).
Supervised learning is the setting in which a machine learning model is trained to map inputs to outputs. The model is trained with a labeled training set of known (input, output) pairs, and then tasked with predicting outputs corresponding to previously unseen inputs. Supervised learning can be further categorized intoclassification, where outputs are categories (“Is the animal in this photo a dog or a cat?”) and regression, where outputs are continuous (“How much does the dog in this photo weigh?”). Most current applications of deep learning in the real world fall under supervised learning.
Unsupervised learning is the setting in which there are no output labels in the training set, and the model’s task is instead to discover some structure in the input data. One common example of unsupervised learning is clustering, where the model is tasked with finding a grouping of the inputs (“Given these photos of animals, sort them into categories”).
One reason why the distinction between supervised and unsupervised learning is important is data efficiency. Hiring humans to label enough data to train a machine learning model in the supervised setting can easily cost thousands of dollars, and so a number of techniques exist to improve data efficiency and label efficiency, including:
Semi-supervised learning: Only some inputs in the training set have labels.
Weakly supervised learning: Some labels in the training set are incorrect.
Transfer learning: Use a model trained on data from task A to get higher performance on related task B, without needing to gather more data from task B. Often, the same more-general “task A” is commonly used to train more-specific “task B”s, e.g. an ImageNet image classifier, trained on millions of photos to identify hundreds of common objects, being fine-tuned on just a few thousand new images to learn a more specific task like telling cats from dogs.
Self-supervised learning (now in vogue!): Learn the structure of a domain by training a model to learn a function where both the inputs and outputs are available from unlabeled training data, e.g. predicting missing words in a paragraph of text, using the other words as input, or predicting missing pixels in an image, using the other pixels as input. The resulting model can then be used for transfer learning, e.g. training a model on arbitrary photos in the self-supervised manner, then fine-tuning on relatively few cat and dog photos to tell them apart from each other. This is a very useful technique as in many domains, unlabeled data is plentifully available from sources like Wikipedia and YouTube.
Labels aside, the availability of the training data itself is the most important factor in the performance of deep learning on tasks which are amenable to it. For example, even if you know that deep learning techniques are great at telling images of objects apart from each other, and want to train a model to tell Martian rocks from Earth rocks, deep learning won’t do you any good unless you have a lot of photos of both! In academia, research into new deep learning methods is largely conducted on publicly available datasets, a constantly evolving set of which is in broad use by the community. Achieving state-of-the-art performance on one of those commonly used data sets is a sought-after achievement in academic research. In industry, on the other hand, ownership of a hard-to-gather dataset can be a key competitive advantage for a company whose technology is powered by AI, as the most effective deep learning algorithms are largely public knowledge but the right data to train them for a specific commercially valuable task can be very hard to source.
The other key factor that dictates the performance of deep learning is the availability of computational power, often referred to as “compute” for short. Given a large enough training set of data, throwing more compute at the training process for a model will typically improve performance substantially. Indeed, achieving state-of-the-art (SOTA) results in some domains now costs hundreds of thousands of dollars in compute bills alone, and those numbers are only growing with time. A dynamic that this sometimes creates is that labs in industry, with their big budgets, will train huge models at great cost to achieve a SOTA result. Meanwhile, academic labs, with their more-modest resources, spend more of their effort on coming up with novel techniques that can drive higher and sometimes SOTA performance with less compute. A happy consequence of this second strain of research is that while achieving SOTA results keeps taking more and more compute and money, achieving the same level of performance on just about any task takes less and less compute with every passing year as the algorithms become more efficient.
Areas of AI research and their applications
Now that we’ve covered the broad categories and principles of machine learning, it’s time to dive into the most prominent areas of AI research. Application areas and types of machine learning intermingle freely: for example, ML for robotics may include both supervised learning for image recognition and reinforcement learning for controlling the robot.
In practice, deep learning has unlocked huge gains in performance in some very specific areas, and knowing what these are is very useful for gauging which applications are likely to be fruitful. Something that is closely related to work in these areas is likely to be achievable with a bit of research and development (R&D), while something totally unconnected is much more of a long shot in the near term.
Computer vision (CV) is the subfield of AI that deals with images, videos, and other related types of data like medical imaging. CV was the first field of AI to be revolutionized by the rise of deep learning, and it remains an extremely active area of research and applications. CV is also the most mature area of deep learning applications, and its high performance is well-understood and applicable to a number of tasks.
Deep learning has been so successful in CV applications for a number of reasons. One is that visual data is unstructured, and deep learning is much better at handling unstructured data than earlier approaches. In addition, deep learning – with its hunger for data – has thrived thanks to the newfound plentitude of visual data. From images on Google Images to videos on YouTube, the Internet is now full of visual content sourced mostly from now-ubiquitous smartphones.
Thanks to these factors, CV techniques powered by deep learning have been getting better and better at dealing with visual data. This is extremely important for downstream applications because visual data is able to capture a great deal of information about the physical world. Think of your own senses: while all are useful, sight is a particularly high-density way of gaining information about the world, from recognizing the objects around you to reading a book. Similarly, computers can now “see”, thanks to the advent of effective computer vision. This has unlocked all sorts of applications.
The most straightforward of those application areas is in tasks based on image classification, which consists of sorting pictures into categories, e.g. cats vs. dogs. The research into image classification has advanced so far that computers often exceed human performance. This has unlocked all sorts of downstream applications: given the right data, you can identify people’s faces to grant them access to a building, identify the items in a retail customer’s shopping basket to charge them automatically as they leave the store without the need for a checkout lane, or automatically identify defective parts in a factory.
Many more complex computer vision tasks exist as well. Two of common interest are object detection, which involves drawing a box around where certain objects are located in an image, and object segmentation, which involves precisely outlining the objects. Object detection has applications like identifying pedestrians in the field of view of an autonomous car’s camera(s). Object segmentation is useful for tasks like finding tumors in radiology images. If you want to locate objects in photos, that is now a very approachable problem.
Computer vision tasks are applicable to higher-dimensional data than 2D images as well. This includes things like 3D medical imaging and video. Video comes with its own set of challenges, including the need for huge amounts of compute due to the large number of frames. One common task in the world of CV for video is object tracking, or identifying where an object is from frame to frame of a video. This is useful in similar contexts as object detection in images, with the added capability of being able to handle movement; e.g. identifying a weed in the field of view of a weeding robot, so as to precisely spray it with herbicide.
Computer vision is rapidly gaining abilities on tasks which involve deeper inference as well. One of particular interest is pose estimation, where a CV model infers the positions of a person’s limbs and joints, or those of a robotic arm’s components, from a photo or video. This already works quite well and comes in handy for things like Snapchat filters, where being able to reliably identify the parts of someone’s face is important for then applying fun effects to it! Another intriguing research direction has been that of 3D reconstruction, or inferring the 3D shapes of objects from 2D images. There’s still plenty to be done there, but simpler “3D from 2D” inference for estimating sizes and distances to objects has been powering things like assisted braking in cars for over a decade.
What’s particularly interesting about computer vision is that it can enable the use of commodity cameras and software (the computer vision model) in places where previously, more customized hardware or human labor would have been required. For example, a factory might install a camera at entryways that only admits workers whose faces are recognized as authorized employees and who are wearing an approved helmet. In this way, a camera and software could replace both ID card scanners and helmet checks. Many more such use cases have already been developed, and many more will be in the years to come.
Computer vision can also serve as a surprisingly universal sensor for novel hardware, like enabling robots to “feel” objects by visually measuring deformation in the membrane that’s in contact with the object in question.
Of course, the fixed costs of training computer vision models for real-world tasks are usually quite high, due to the expense of both hiring researchers and engineers and collecting and labeling data (unless you are lucky enough to be able to use existing data like Wikipedia, but then your competitors will be too!) Deploying computer vision models, like deploying other machine learning models, also comes with the nontrivial variable costs of adjusting models to individual customers’ data and needs. These economics will dictate where computer vision is deployed in the next couple of decades, but expect to see it invisibly powering a wide range of applications across our economy in the decades to come.
The wide deployment of effective computer vision means that computers can now “The wide deployment of effective computer vision means that computers can now “see,” but they can also now “hear.” Just like visual data, audio data is complex and unstructured, which made it hard to work with before the rise of effective deep learning. And just like with visual data, deep learning has made it dramatically easier to work with audio – even more so than with visual data, as audio is simpler.
This has powered the rise of a number of applications. Ten years ago, speech-to-text transcription was painfully inaccurate; now, Siri and Google Assistant are able to transcribe spoken commands and questions with surprisingly high accuracy. Working with music has changed completely as well. Ten years ago, Pandora suggested music to users based on an extensive database of hand-tagged information about songs. Now, Spotify combines that approach with algorithms that actually analyze the songs themselves with the help of deep learning to better match them to users’ tastes.
Audio, while less studied in the research community than vision, is a very interesting field for AI applications because it’s the medium for human speech. Speech is in many ways the easiest and most natural way that we as humans communicate. That’s exactly why many tech companies are creating new platforms for audio interfaces, from Alexa speakers to Apple AirPods. Expect to see many more applications in the years to come, and to be interacting with computers more and more by talking to them. Numerous startups and big companies are working on the innovations to enable this change, and I expect many more to join them.
Natural language processing
Alongside computer vision, the natural language processing (NLP) community is one of the most active in the world of deep learning. Broadly speaking, natural-language processing has to do with any task that primarily deals with human language, like when Siri answers questions or Google Translate translates text from one language to another.
You may be wondering at this point why deep learning is applicable to natural language. After all, the types of unstructured data we’ve discussed so far are very different from language. Images, videos, and audio are all easy to represent in vector form – that is, as a long list of numbers. To simplify a bit, an image is a long list of pixel (dot) colors; a video is a long list of images, and an audio clip is a long list of sound intensities. But what about natural language? Each language is composed of some finite list of root words, and indeed, it’s possible to approach some NLP problems by assigning an integer index to each word and then training a relatively simple ML model not based on deep learning to solve the task, using the word indices as input data. This approach has worked well for some problems, like email spam filtering, but fails to capture the complexities that human language can express.
But it turns out that there’s a way to represent words as vectors that allows more complex machine learning techniques, including deep learning, to work well with natural language and blow the performance of techniques that work directly with indexed words out of the water. That way is known as word embeddings. The principle is relatively simple: each word in a human language has some rate at which it co-occurs with every other word in the language, where a co-occurrence is when the two words are used with fewer than 5 (or some other small number) words between them, indicating some association between their meanings. For example, “apple” and “tree” have a high co-occurrence frequency in English, while “apple” and “tangent” have a much lower one. In this manner, it’s possible to pick a set of reference words – e.g. the 10,000 most common words in English, and a reference set of text – e.g. Wikipedia, and for each English word found in Wikipedia, count up the number of times it occurs within 5 words of each of the 10,000 reference words. This yields a 10,000-dimensional vector (list of numbers) of co-occurrence counts for every English word found in Wikipedia. The resulting list of word vectors can then be reduced to fewer than 10,000 dimensions – 100 is a common choice – without too much loss of information, in a manner similar to drawing a cube on a flat piece of paper. This then leaves us with a ~100 dimensional word vector for most words in the language.
These word vectors, also known as embeddings, have some very interesting properties. For one, words with similar meanings tend to cluster together in the 100 (or 50, or 200…) dimensional space they are embedded in such that “apple” and “orange” are closer together than “apple” and “knight”. This is because “apple” and “orange” are often found near the same words, like “tree” and “eat”, while “knight” is more likely to be found near “chess” and “horse”. Remarkably, analogies are often preserved in the n-dimensional space in which the word vectors live. For example, “king” and “queen” are likely to be separated by approximately the same distance, in the same direction, as “man” and “woman”. Finally, and most remarkably, it turns out that the shapes of the “clouds” of different languages’ word vectors are similar enough between languages that it’s possible to line them up and figure out quite accurately which word in e.g. French corresponds to “dog”, or “horse”, in English – just by lining up the clouds of all the words in English and French. In this way, it’s possible to translate between languages with some degree of accuracy with no examples whatsoever of translations between the languages! In the 19th century, it was necessary to find the Rosetta Stone to start deciphering Ancient Egyptian, but in 2020 it’s possible to decipher a hitherto unknown language using nothing more than a large body of text in that language alone – thanks to word embeddings.
That said, word embeddings aren’t just useful in and of themselves; they also enable the application of deep learning to human language by mapping words to vectors, which deep learning techniques use as inputs. Ironically, in this case, mapping somewhat-structured data (individual words) to a less-structured but more numerical form (word vectors) enables higher performance on most NLP tasks.
The range of these tasks is quite wide. One important NLP task is translation. While it’s possible to translate between languages using word vector cloud alignment as described above, or using the older methods that powered Google Translate for years, modern techniques based on deep learning have achieved much better levels of performance. Many other tasks are constantly being worked on by the research community, including question answering, which involves finding the answer to a question in a body of text, and sarcasm detection, which is exactly what it sounds like. A good sampler of tasks currently of interest to the research community is found in the SuperGLUE benchmark, which is used to test the performance of NLP models. A broader list that gives a good overview of many NLP tasks is on this Wikipedia page.
The capabilities of NLP techniques have seen incredible progress over the past two years, more so than any other field of machine learning. Much of this has been driven by the rise of effective transfer learning for NLP. Just like a computer vision model trained on a large number of photos of objects to classify those objects into categories can then be trained on a smaller number of photos of e.g. cats and dogs to tell cats from dogs, an NLP model trained on a general task can be fine-tuned to a more specific task as well. The more general task is typically some variation of predicting missing words in a piece of text. The more specific task could question answering, sarcasm detection, or any of a number of others.
The trick is telling which tasks are well within the abilities of modern NLP, which tasks are on the horizon, and which are far in the future. Unfortunately, this is a difficult challenge, as human language itself is capable of expressing both very simple and very complex things.
Simpler NLP tasks which rely on surface-level language features are now generally approachable with a high degree of accuracy. For example, Grammarly has built a great business by making software that corrects word choice and sentence structure in a much more sophisticated way than traditional autocorrect – with technology powered by deep learning.
However, tasks which rely on some understanding of the meaning of text are more complicated. The most important thing to remember when it comes to those tasks is that state-of-the-art NLP models only capture patterns of words, not deeper meanings. So, a good model will “know” that “Eiffel” and “Tower” are closely related, and even that “baguette” is likely to follow in a subsequent sentence. But it will be unable to tell whether the narrator is 10km, 100km, or 1000km from Lyon unless it has been trained on text specifically mentioning the distance from Paris to Lyon, as it does not have any true knowledge of the world. There is now quite a bit of research effort to address this problem and imbue NLP models with more explicit knowledge about the world, but these efforts have not yet changed the game.
Despite this shortcoming, the state-of-the-art NLP models of 2020 are astonishingly powerful. The catch is that for complex tasks which require modeling the meaning of language, a model’s accuracy on that task will depend tremendously on the amount of specific data it is trained on. For example, a chatbot trained to answer common customer questions on a website might be able to achieve very high accuracy on a predefined and small set of interactions, if it is trained on something like 100,000 examples. But attempting to answer all customer questions would yield a much worse rate of appropriate responses, almost certainly low enough to be unacceptable for use by a business. For this and many other tasks, there is likely to be a “long tail” of examples that are especially hard for NLP models to handle. Leaving those to humans or some other fallback strategy while handling the more common examples automatically is a common strategy. This is what Amazon does with its customer support chat, when common queries are answered automatically but ones that the chatbot cannot answer confidently are routed to humans, and what Siri does by responding to only a small and fixed set of allowable questions and directing the rest to Google.
Those sorts of approaches, with a strategy of biting off common but simple NLP tasks that currently consume significant human effort, or are cost-prohibitive to do with human labor entirely, and automating them while leaving the more complex tasks to humans, are starting to gain steam and are poised to have dramatic impact in industry the years to come. There are many examples of successful use cases already. Gmail automatically suggests completions for sentences and canned replies to simple emails, but lets the user choose what to include in their message. Legal software lets lawyers and paralegals search through documents and identify notable clauses with much more power and precision than was possible with keyword-only search in the recent past. And reams of documents in fields like transportation and logistics are starting to be processed automatically for later reference by humans, thanks to the application of computer vision and NLP to document processing.
The recent advances made by NLP have both multiplied what it can do and reduced the difficulty of achieving high performance on many tasks. And the possible applications of these techniques are many. The channels that carry information between people are the sinews of the world economy, and people communicate in natural language, be that over the phone or electronic messaging. For the first time, we can work with that human-to-human information flow in scalable ways: reading an additional 100 articles about an industry per week used to mean hiring an additional analyst, and reading an additional 1000 used to mean hiring 10, but now, NLP summarization tools can summarize ten million articles almost as easily as they can summarize ten thousand.
Another extremely active area of research in the machine learning community, and one that has seen tremendous progress with the application of deep learning techniques, is the field of reinforcement learning (RL). Reinforcement learning is unlike other paradigms of machine learning in that it involves step-by-step decision making towards some goal. Some example tasks where an RL framework is relevant include playing video games and board games, and controlling robots and autonomous vehicles. In each case, there is a changeable state (e.g. the positions of pieces on a chessboard, or the pixels of an Atari game’s screen), possible actions that can be taken (legal moves in chess, or the steering input to an autonomous vehicle), and a “reward” that denotes a successful or unsuccessful outcome (e.g. a positive reward for winning a chess game, a negative reward for crashing a car) which is used as the signal to update the model.
The reward in RL is what frames the training process for the model. In the same way that a supervised learning model’s underlying functions are “nudged” when it makes mistakes on training examples to reduce similar mistakes in the future, an RL model’s underlying functions are “nudged” to be more likely to take actions that led to positive rewards, and less likely to take actions that led to negative rewards.
Reinforcement learning has seen tremendous advances in recent years thanks to the heavy application of deep learning techniques, with modern approaches (deep RL) able to defeat humans at the board game Go (previously the last stronghold of human superiority over computers in popular board games) and score highly at many video games.
However, as with other domains, the performance of deep learning techniques in RL is chiefly limited by the availability of data. And more so than in other domains, collecting data for training RL algorithms is fiendishly difficult, as the step-by-step nature of the setting often requires millions of iterations of alternatively applying the model to its target task and retraining it on the data gathered from those applications.
This means that in settings where running RL models and collecting data on their performance is cheap and fast – i.e. in settings that can exist entirely in a computer, like video games and board games – deep RL algorithms have shown tremendous gains in performance. But in settings where interaction with the real world is necessary for measuring the performance of models and collecting the data with which to improve their performance – like robotics and online education – the rate at which data can be collected is dramatically lower than in the fully virtual settings, and this has crippled the performance of deep RL in those settings. As a result, in settings like robotic control and autonomous vehicles, controls software is still not commonly powered by deep learning despite the power of those techniques in other areas.
There is now a significant push to close the gap between RL in virtual environments and RL in the real world, with one interesting direction being the use of simulation and its mapping to the real world, otherwise known as sim2real. For example, a deep RL algorithm for robotic control might be trained on a huge number of runs on a simulated robot, then fine-tuned on the real physical robot for thousands rather than millions of runs. These techniques show promise, but are still far from perfect. For example, OpenAI recently demonstrated a robotic hand that had been trained with a sim2real approach to solve a Rubik’s cube. Training the hand to work perfectly in simulation took them about 2 months, but getting it to then work in the real world took another 2 years, and the resulting performance was still extremely unreliable and only worked for configurations of the cube that required few moves to solve.
Another approach which is useful in addressing the data-inefficiency of reinforcement learning is known as imitation learning. Models trained by reinforcement learning typically blunder about for a long time after training starts, learning haphazardly to take actions that lead to positive rewards. Imitation learning instead trains models to choose actions that roughly imitate demonstrations of the task at hand, which are typically sourced from humans. One approach is to train a model with imitation learning on some number of human demonstrations (e.g. of people playing Tetris) and then to switch to trial-and-error reinforcement learning to further improve performance. This is apparently the approach taken by Pieter Abeel’s startup Covariant AI, which is attempting to train robot arms to move objects between bins in warehouses – a very hot area right now.
The bet that Covariant and its many competitors are placing is that while “RL for the real world” is not working well enough for most applications yet, it is on the cusp of a breakthrough that will unlock many new applications and billions of dollars of value. Indeed, RL is showing new and impressive results on a regular basis, from the abovementioned Rubik’s Cube robot hand to Google’s recent result showing success in designing layouts for computer chips (a very complex task) with the help of RL.
When the performance of RL crosses into industrially-useful territory, it is likely to follow a similar pattern as NLP technology: deployment in partnership with humans. Just as a chatbot might be able to handle 50% of customer inquiries automatically and route the rest to human agents, a bin-picking robot for warehouses might be able to handle 99% of objects and require humans walking around the warehouse to handle the 1% that are the hardest to pick up.
It is still unclear whether RL will cross the line into widely-useful performance in one year or in ten, but when it does, expect many consequences. Not the least will be much more adaptable robots, which will mean a number of use cases outside the factories and warehouses where most robots are currently deployed. Roombas may not be the only robots people encounter in their day-to-day lives for long!
So far, we’ve discussed many ways in which deep learning models are able to process unstructured input data, whether it’s in the form of images, audio, video, text, or something else entirely. However, deep learning has enabled something else that’s quite remarkable: training models that output unstructured data.
What this means is that it’s now possible to train a model to generate new images, audio, text, etc. This can take the form of training a model on many images of people’s faces to generate new images of people who don’t exist at all. It can also take the form of training a model on Wikipedia articles to output plausible-seeming sentences and paragraphs. This works both in the non-conditional setting (“Given these millions of photos of people’s faces, come up with more that look similar”) and in the conditional setting (“Given these millions of paragraphs with accompanying audio clips of people reading them, learn to take a new paragraph and generate audio that sounds like someone reading it out loud”).
Just like NLP, generative models have seen tremendous advances in the last couple of years, and are now showing incredible results. It’s now possible to do things like generate extremely realistic images from nothing more than a text description or a sketch, or a video of a news anchor or politician giving a speech that they never gave at all (the controversial “deep fakes”), or to automatically generate captions for pictures and videos.
The potential for downstream applications are many. One area that’s already seeing substantial application of generative models is computational photography. It used to be the case that cell phone cameras were much less capable than full-fledged DSLRs with large lenses. But it turns out that in many cases, the software on the phone can more than correct for the shortcomings of the hardware. Deep learning like image denoising and image superresolution are able to dramatically increase the perceived quality of photos by inferring missing details, and companies like Google and Apple have already deployed these or similar techniques to their phones to enable things like professional-quality portraits and incredible low-light photography. One funny consequence may be the reliability of details in photos: for example, as more sophisticated algorithms are deployed to phones, a photo taken at dusk may look as good as one taken in the daytime, but the colors of objects may be slightly or even significantly wrong as the algorithm “guesses” those that there is not enough light to “see” properly.
Other artistic applications of generative models are sure to follow. Creative tools are likely to be some of the first areas to be disrupted. Imagine an alternative to PhotoShop where instead of having to manually airbrush out a pole and fill in the background, there was a button to “remove object” in one click that did it automatically. Or imagine an animation program that could automatically design a scene based on text instructions and place characters in it, then have those characters perform actions and express emotions based on high-level commands – removing the need to manually design much at all. Or a music production program that could automatically adjust the style of a song to be more exciting, or more menacing, or add incredibly realistic instruments to accompany a singer dynamically. These tools are likely to start out with much less room for customization and lower-quality output than existing high-effort tools, but enable hobbyists (even children and teenagers!) to produce songs and movies that would have previously taken much more skill and labor. This could have as much of an effect on the landscape of content production as the advent of the Internet and YouTube, SoundCloud, etc. Eventually, I expect the classic disruptive technology cycle to take place and the high-leverage creative tools to make their way upmarket and take over more and more of the creative software space.
Recommender systems may well be the AI application that has already driven the most economic impact. Their purpose is to match content with users, whether that content is a video on YouTube, a movie on Netflix, or an ad on Google or Facebook.
Recommender systems are a key part of the “secret sauce” of a number of large companies, and often rely on proprietary data, so their development has mostly taken place in secret within the research labs of large companies, with little activity in academia. Thus, it’s hard for outsiders to tell just how capable they are, and what techniques are being used to advance the state of the art in the field. However, it’s safe to say from the incredible amounts of money made by Google, Facebook, Netflix and others that the systems work very well. It’s also a safe bet that deep learning has been having an impact on the field, whether by automatically analyzing the content of YouTube videos or even by enhancing the matching algorithms themselves.
Recommender systems are not the only “less sexy” under-the-radar place where modern AI can generate significant economic impact. Sophisticated quantitative finance firms like Renaissance and D. E. Shaw have likely been using deep learning to gain a competitive edge in the markets for much longer than those techniques have been in the public eye — likely since the 90s or even earlier.
But many other parts of the economy also rely on solving optimization problems to make money, from FedEx package routing to the operation of electrical grids. The optimization of these business problems, often referred to as operations research, has been the topic of study since at least the 1940s, and has met with great success. In many cases, well-established approaches from computer science and statistics, like linear programming, are more than adequate to find near-optimal solutions to business optimization problems.
But there are also cases where it will be possible to find better solutions to business optimization problems by finding complex patterns in high-dimensional data: exactly where deep learning excels. And even in cases where applying deep learning techniques can yield only a small edge over more traditional techniques, each percentage point of increased efficiency can be worth billions of dollars. As such, I expect that a number of companies are already hard at work in this space and that more will emerge in the years to come.
Alongside the places where deep learning-powered AI is truly driving huge progress, there are a number of realms where perception of the capabilities of AI far surpasses what it can actually do. Much of this is self-reinforcing: the more that everyone believes that AI is going to acquire incredible powers, the more they pay attention to it, and this creates incentives for the media to hype up the power of AI to draw clicks, and for businesspeople to hype up their usage of AI to draw media attention and investors dollars. Even AI researchers are often tempted to overstate the generality of their results to generate enthusiasm and win grants.
Interestingly, this cycle has played out a number of times before. Each time, a wave of progress in the genuine capabilities of AI excited public sentiment, which in turn incited a wave of overly enthusiastic promises from the media, business, and research communities. Eventually, expectations rose so high that after the technology inevitably failed to fulfil most of them, an “AI winter” followed, with enthusiasm and funding depressed for years.
I expect a similar dynamic to play out soon in public sentiment after AI fails to deliver most of the more outrageous things that have been promised over the past few years. However, I don’t expect a full “winter” to take place, as AI techniques are now creating incredible economic value as detailed in this essay, and that stream of money will continue to stimulate interest in AI in the business and research communities. Before long, we should reach a point where expectations of AI’s abilities are better matched to reality.
In the meantime, it’s best to treat the bolder claims around AI with a skeptical eye. Again, the best way to tell if a claim about a new AI application has a realistic chance of being true is to consider whether it lines up well with an existing research direction, or is similar enough that existing techniques that perform well are likely to generalize to it.
A few examples: real or hype?
Now, perhaps a few example claims are in order! Here, I’ll outline my opinions on a few that I’ve heard, either exactly or approximately, over the past couple of years. This list is of course far from exhaustive, but I hope it’s a useful sample of more and less realistic claims about what AI can do.
“We’re going to build an AI that can defeat humans at most video games.”: This one seems likely! RL techniques are well-suited to learning high performance on video games, and that performance is rising quickly. Of course, AI currently performs much better on some games than others, and there are likely to be games where humans have an edge for the foreseeable future.
“We’re going to build an AI that will be able to drive anywhere a human can.” This claim is one where reality has fallen far short of expectations. Training AI to interact with the physical world turns out to be a hard and messy problem, and driving human passengers is an area where mistakes are particularly costly. Expect to see autonomous driving roll out only in limited ways in the near future – e.g. small, slow food delivery vehicles which are unlikely to hurt anyone in a crash and can be remote-controlled in the case of unexpected circumstances (another example of AI deployment in partnership with humans!); Nuro is working on something like this. Highway trucking in good weather is another task that’s easier than full autonomy, and will have tremendous economic impact when even partially automated.
“We’re going to build an AI that will find drugs to cure diseases.” In theory, finding small molecules or biologics to bind to a known target to affect the course of a disease is a problem that can be solved computationally. In practice, computational approaches to drug discovery are at best an aid to wet-lab work. Could that change? Absolutely, but I haven’t yet heard of any applications of deep learning that are redefining performance in drug discovery in the same way that they did in image classification. Gathering data in this space is hard, simulations of binding affinity are far from perfect, and there is so much known structure in the domain – the laws of physics and chemistry – that deep learning, which is particularly good at discovering hard-to-find structure in data, may not be the right tool for the job. Computational modeling of drug candidates will improve, possibly with some help from deep learning techniques, but it is unlikely to replace in vitro and in vivo testing entirely due to the complexity of biological systems. This will change only when we can simulate a human body down to the molecular level, something unlikely to happen for many decades. Update: Jonas Köhler has contributed a thoughtful post on Reddit disagreeing with some of my arguments here; I highly recommend taking a look!
“We’re going to build an AI that will teach students.” This one sounds great, and my hope of making progress in this direction is what led me to start a PhD at Stanford. However, it turns out that education is an incredibly complex and social task, and one that it is tightly coupled with just about everything that makes us human. We as humans have evolved to be able to infer each other’s thoughts and feelings in ways that let us teach each other effectively as well. This, however, is far from the capabilities of AI right now: if even realistic chatbots are out of reach right now, what hope is there for an electronic Socrates? There is educational software now that is reaching millions of learners (think Duolingo and Anki), and adaptive algorithms do contribute to its performance. However, I haven’t seen any cases so far where deep learning techniques outperform simple adaptive strategies like moving students to a new topic after they get three questions in a row right on the previous one. It may be that with more and more students interacting with educational software, the vast volume of data will yield significantly improved algorithms, but that’s going to be a tough nut to crack at best.
Artificial General Intelligence?
Finally, I would be remiss if I didn’t mention the debate around artificial general intelligence (AGI), also known as strong AI. AGI is essentially AI that can do everything a human can – and probably much more. Every time the state of AI advances, there tends to be alarm about the impending rise of AGI, like the famous example in the Terminator movies when the Skynet AI gains general intelligence and takes over the world. But in reality, we are almost certainly far from the advent of AGI. Human intelligence is still poorly understood, and so simulating it in a computer is not something that’s approachable directly.
It’s true that individual functions of the human brain, like visual and auditory perception, are now well-approximated by deep learning methods. But other faculties, like abstract reasoning, are still all-but-unapproachable by any ML techniques. Could this change, just as vision went from being unapproachable to almost trivial? Certainly, but that would require at least one and probably a number of breakthroughs in ML. And even if we learn to approximate all the functions of the human brain individually with ML, it will be some time before we stitch those programs together into a coherent whole and learn to train that model in a way that will teach it as much about the world as humans learn over a lifespan of decades. If AGI is gradually cobbled together from different advances in ML as it is now practiced, we will have plenty of warning, and the emergence is likely to be gradual enough that there is no single identifiable “before” and “after.”
It’s also possible to sidestep the complexities of duplicating the functions of human cognition one-by-one with ML and try to directly simulate a human brain neuron-by-neuron, but this too is a daunting problem. A brain is an incredibly complex system, not yet fully understood, and researchers have thus far been unable to simulate the nervous system of any creature more complex than a worm. There will surely be progress here as well, but here too it will be incremental. If there comes a day years or more likely decades from now when we build a supercomputer that successfully embodies a conscious human mind by simulating its physical structure, we will have hints that such a thing is in the works well before it actually happens, just like we would if existing approaches to ML gradually add up to full AGI.
Could we stumble into AGI in some way that doesn’t require decades of progress? The only way I can imagine this happening is if current deep learning methods turn out to be far more powerful than imagined. This isn’t entirely impossible: current methods have shown a remarkable ability to gain performance in areas like NLP from nothing more than constant increases in model size and computational power and data available for training the model. The company OpenAI claims to be pursuing exactly this strategy, apparently with a focus on reinforcement learning and similar approaches – the idea seems to be to expose increasingly huge RL models to increasingly complex real-world problems and see if AGI emerges.
However, it seems unlikely that these constant increases in power applied to existing techniques will yield a sudden breakthrough in performance that unlocks AGI. RL algorithms are so far incapable of so much as reliably solving a physical Rubik’s Cube. There seems to be no reason to believe that they will suddenly learn to do everything humans can. It is of course possible that this happens, and with that eventuality in mind it’s worth thinking through how best to manage the rise of AGI when it comes about – it will indeed be an incredibly powerful technology, maybe the most powerful ever in human history. But the day it comes into being is probably still decades away, if not longer.
In this time of quick progress in the capabilities of AI, it’s especially important to be judicious in telling new capabilities of the technology – and the opportunities they open up – with marketing hype, designed more to attract attention than to represent reality. Don’t be fooled by overly optimistic narratives, but also don’t assume that AI is all snake oil: there are many frontiers opening up ahead of us. For my fellow researchers and businesspeople reading this who will be opening those frontiers in the future, I wish you luck!
If you have any comments, please leave them below! I also welcome any messages at (My first name) @ (My last name).com – get in touch if you want to chat about AI, startups, or anything else that’s on your mind.
This year, I had a chance to attend NeurIPS, the most prominent conference in artificial intelligence and machine learning (AI/ML), to present a workshop paper. I’ve spent the past couple of years working on a combination of AI research in various subfields and tech startups and so have been following the evolution of AI with interest. This conference, bringing together as it does some of the best researchers and practitioners in the field, was a particularly good vantage point to gauge the state of, and changes in, how people are thinking about and using AI. Here, I’ve collected some of my impressions in the hopes that they might be useful to others. If you’re curious about other people’s perspectives, Andrey Kurenkov collected some links to varioustalksandkey trends in his recent post, which is also worth a look.
The most overarching theme I noticed at NeurIPS was the maturation of deep learning as a set of techniques. Since AlexNet won the ImageNet challenge resoundingly in 2012 by applying deep learning to a contest previously dominated by classical computer vision, deep learning has attracted a very large share of the attention within the field of AI/ML. Since then, the efforts of countless researchers developing deep learning and applying it to various problems have accomplished things like beating humans at Go, training robotic hands to solve Rubik’s cubes, and transcribing speech with unprecedented accuracy. Successes like these have generated excitement both within the AI community and elsewhere, with the mainstream impression tending towards an overestimate of what AI can actually do, fueled by the more narrowly circumscribed successes of new, largely deep-learning powered, methods. (Gary Marcus has a great recent essay talking about this in more detail.)
However, a perspective that I find more useful than “the robots are coming” is the one I heard from Michael I. Jordan when he came to Stanford to give a talk in which he described modern machine learning as the emerging field of engineering which deals with data. Consistent with this perspective, I saw a number of lines of inquiry at NeurIPS which are developing the field into more nuanced directions than “Got a prediction problem? Throw a deep net at it.” I’ll break down my impressions into three general areas: making models more robust and generalizable for the real world, making models more efficient, and interesting and emerging applications. While I don’t claim that my impressions are a representative sample of the field as a whole, I hope they will prove useful nonetheless.
Robustness and generalizability
One prominent category of work that I saw at NeurIPS was that which addressed real-word requirements for successfully deploying models other than just high test-set accuracy. While a canonical case of a successful deep learning model, like an image classifier trained on the ImageNet dataset, is successful within its own domain, the real world in which models must be grounded and deployed is complex and ambiguous in ways which models much address if they are to be useful in practice.
One of these complexities is calibration: the ability of a model to estimate the confidence with which it makes predictions. For many real-world tasks, it’s necessary not only to have an argmax prediction, but to know how likely that prediction is to be accurate, so as to inform the weight given to that prediction in subsequent decision-making. A number of papers at NeurIPS addressed better approaches to this complexity.
Another complexity is ensuring that models are assigning appropriate importance to features which are semantically meaningful and generalizable, which in one way or another includes representation learning, interpretability and adversarial examples. A story I heard that illustrates the motivation for this line of research had its origins in a hospital, which had created a dataset of (if I remember correctly) chest X-ray images with associated labels of which patients had pneumonia and which did not. When researchers trained a model to predict the pneumonia labels, its out-of-sample performance was excellent. However, further digging revealed that in that hospital, patients likely to have pneumonia were sent to the “high-priority” X-ray machine, and lower-priority patients were sent to another machine entirely. It also emerged that the machines left characteristic visual signatures on the scans they generated and that the model had learned to use those signatures as the primary feature for its predictions, leading to predictions that were not based off of anything semantically relevant to pneumonia status and which would neither yield incremental useful information in the original hospital nor generalize in any way to other hospitals and machines.
This story is an example of a “clever Hans” moment, in which a model “cheats” by finding a quirk of the dataset it is trained on without learning anything meaningful and generalizable about the underlying task. I had a great conversation about this with Klaus-Robert Müller, whose paper on the phenomenon is well worth a read. I saw a number of other papers at NeurIPS dealing with interpretability of models, as well as representation learning, the related study of how models represent data. A notable subset of this work was in disentangled representations, an approach which aims to induce models to learn representations of data which are composed of meaningfully and/or usefully factorized components. An example would be a generative model of human faces which learns latent dimensions corresponding to hair color, emotion, etc., thus allowing better interpretability and control of the task.
A final direction attracting a significant amount of attention in the “what models learn” category was that of adversarial examples, which are data points which have semantically meaningful features corresponding to one category, but less semantically meaningful features which bias a model’s prediction in a different direction – for example, a photo that looks like a panda bear to humans but which contains noise that makes a model predict it to be a tree. Recent work in adversarial training has made progress in making models more resilient to such adversarial examples, and there were a number of papers at NeurIPS in this vein. I also had a very interesting conversation with Dimitris Tsipras, who was a coauthor on this paper, which found results which suggest that image classifiers may use some less-robust features for classification, which can be perturbed to generate adversarial examples without modifying the more robust features which humans primarily focus on. This is an emerging area of investigation and the literature is worth a closer look.
All in all, it appears that the community is spending considerable effort in making models more robust and generalizable for use in the real world, and I’m excited to see what further fruit this bears.
As the power and applicability of deep learning grows, we are seeing a transition of the field from the 0-to-1 phase, in which the most important results have to do with what is or is not possible at all, to a 1-to-n phase, in which tuning and optimizing the techniques previously found to be useful becomes more important. And just as the deep learning revolution had its underlying roots in the greater availability of compute and data, so too are the most prominent directions in this area which I saw at NeurIPS concerned with improving the data-efficiency and the computational efficiency of models.
Ultimately, deep learning depends on large amounts of data to be useful, but collecting this data and labeling it (for supervised approaches) are typically the most expensive and difficult stages of applying deep learning to a problem. A number of papers at NeurIPS had to do with reducing the severity of this issue. Many had to do with self-supervised learning, in which a model is trained to represent the underlying structure of a dataset by using implicit rather than explicit labels, e.g. predicting pixels of an image from neighboring pixels or predicting words in text from adjacent words. Another approach which a number of papers dealt with is semi-supervised learning, where models are trained on a combination of labeled and unlabeled data. And finally, weakly supervised learning has to do with learning models from imperfect labels, which are cheaper and easier to collect than perfect or almost perfect ones. Chris Ré’s group at Stanford, with their Snorkel project, are prominent in this area, and had at least one paper on weakly supervised learning at NeurIPS this year. This also falls under the “systems for ML” category, mentioned in the next section.
Another prominent direction having to do with data efficiency (and also connected to representation learning) is that of meta/transfer/multi-task learning. Each of these approaches seeks to have models efficiently learn representations which are useful across tasks, thereby increasing the speed and data-efficiency with which new tasks can be tackled, up to and including one- or even zero-shot learning (learning a new task from a single example, or no examples at all). One interesting paper among many on these topics was this one, which introduces an approach to trading off regularization on cross-task vs. task-specific learning in the meta-learning setting.
Another direction in data efficiency which I noticed prominently at NeurIPS had to do with shaping the space within which models learn to better reflect the structure of the world within which they operate. This can broadly be thought of as “stronger priors” (although it seems the term “priors” itself is being used less frequently). Essentially, by constraining learning with some prior knowledge of how the world works, data can be used for learning more efficiently within this smaller space of possibilities. In this vein, I saw a couple of papers (here and here) improving models’ abilities to learn representations of the 3D world through approaches informed by the geometric structure of the world. I also saw a couple of papers (here and here, both from folks at Stanford) which use natural language to ground their representations of what they learn. This is an intriguing approach because we ourselves use natural language to ground and communicate our perception of the world, and forcing models to learn representations mediated by our languages in a sense imposes real-world priors upon the models. A final paper I’d mention in the category of priors as well is this one, which showed surprisingly good performance on MNIST of networks “trained” by architecture search alone – while this may not be immediately applicable, it is suggestive of the degree to which picking network architecture carefully (i.e. in a way that reflects the structure of a task) can make the learning process faster and cheaper.
One final direction relevant to data efficiency is that of privacy-aware learning. In some cases (and likely more to come in the future), data availability is bottlenecked by privacy constraints. A number of papers I saw, including many in the area of federated learning, dealt with how to learn from large amounts of data without compromising the privacy of the people or organizations from which the data originated.
As well as data efficiency, efficiency with regards to computational resources – i.e. compute and memory/storage – was also a prominent direction of many papers at NeurIPS. I saw a number of papers having to do with the compression of models and embeddings (the representation of the data used by models in certain settings). Shrinking models and embeddings/representations of data reduces both computational and storage requirements, allowing more “bang for the buck”. I also saw some interesting work in biologically-inspired neural networks, like this paper from Guru Raghavan at Caltech. One motivation in this area is that while there will be certain limits to how many matrix multiplications and additions can be performed per dollar/second on general-purpose hardware to push the capabilities of modern deep learning, it may be possible to use special-purpose hardware which more closely approximates the functions of biological neurons to achieve higher performance for certain tasks. I heard a combination of curiosity and skepticism around biologically-inspired approaches from fellow NeurIPS attendees: this is an area to watch for the 10+ year horizon.
Directions and applications
Finally, while at NeurIPS I also found it very interesting to get a feel for the higher-level trends in various subfields of AI/ML and a feel for the different applications now possible, or becoming possible, thanks to recent advances in research. This section is more of a smorgasbord than a narrative; skip around as interest dictates.
Graph neural networks
One area which I should mention seeing a number of papers around is that of graph neural networks. These networks are able to more effectively represent data in settings with graph-like structure, but as I know very little about this direction personally, I’ll instead refer interested readers to the page of the NeurIPS workshop on graph representation learning as a starting point into the literature.
Reinforcement learning and contextual bandits
Another area in which I saw an absolutely tremendous amount of work was that of contextual bandits and reinforcement learning (RL). A few approaches which I saw a number of papers in were hierarchical RL (related to representation learning) and imitation learning (in a sense, setting priors for models through human demonstration). I also saw a number of papers dealing with long-horizon RL, in line with recent success in RL tasks requiring planning further into the future, e.g. the game Montezuma’s Revenge. A number of papers also had to do with transferring from simulation to the real world (sim2real), including OpenAI’s striking demonstration of teaching a robotic hand to solve a Rubik’s cube in the real world after training in-simulation. I also talked to Marvin Zhang from Berkeley about a paper he coauthored in which a robot was trained on videos of human demonstrations – “demonstration to real” rather than “simulation to real” learning.
However, it is important to note that in practice, RL for the real world, i.e. hardware/robotics, is still not quite there. RL has found great success in settings where the state of the problem is fully representable in software, like Atari games or board games like Go. However, generalizing to the much messier real world has proved more difficult – even the OpenAI team behind the Rubik’s cube project spent 3 months solving the problem in-simulation and then almost 2 years getting it to generalize to a real robotic hand with a real Rubik’s cube – and even then, with far less than 100% reliability. It will be interesting to see how quickly new approaches to RL can square the circle of generalizing to the real world. I had a great conversation with Kirill Polzounov and Lee Redden from Blue River about this – they presented a paper on a plugin they developed for OpenAI gym allowing people to quickly test RL algorithms on real-world hardware. I’m excited to see how quickly “RL for the real world” progresses – if we see an inflection point like the one vision hit in 2012, the implications for robotics could be tremendous.
Natural language processing
Another area worth mentioning is NLP (natural language processing), in which I’ve done some work personally. The Transformers/transferable language model revolution is still bearing fruit, with a number of papers showing good results leveraging those techniques. I was also intrigued by a paper that claimed unprecedented long-horizon performance for memory-augmented RNNs. It will be interesting to see if the pendulum swings back from “attention is all you need” back to more traditional RNN approaches. It’s also worth noting that NLP is starting to hit its stride for real-world applications. I have a few friends and acquaintances working on startups in the field, including Brian Li of Compos.ai, whom I ran into at NeurIPS. I also enjoyed peeking into the workshop on document intelligence – it turns out NLP for the legal space is already a multi-billion dollar industry! Broadly speaking, natural language is the informational connective tissue of human society, and techniques to apply computational approaches to this buzzing web of information will only grow in the future.
Another area I’ll treat briefly, from personal ignorance rather than unimportance, is that of SysML – i.e. systems for ML and ML for systems. This is an exploding field, as evidenced by the numerous papers presented at NeurIPS and the workshops in the field. One particularly interesting talk was the one Jeff Dean gave at the ML for systems workshop – definitely worth a watch if you can find a recording (please leave a comment if you do). He and his team at Google managed to train a network to lay out ASICs much more quickly than human engineers could, and met or even surpassed the performance of ASICs laid out by humans. A number of other papers also showed compelling results in optimizing everything from memory allocation to detecting defective GPUs with the help of deep learning. A number of papers also addressed the “systems for ML” direction, such as the Snorkel paper mentioned above.
Generative models have reached a stage of significant maturity and are now being used as a tool for other directions as well as being a research direction in their own right. The performance of the models themselves is now incredible, with models like BigGAN having previously established a photorealistic state of the art for vision, and I saw a number of papers yielding unbelievably good results in conditional text-to-image generation, video-to-video mapping, audio generation, and more. I’ve been thinking about a number of downstream applications of these techniques, including some in the fashion industry and visual and musical creative tools, and I’m looking forward to seeing what emerges in industry in the years to come. Applications of generative models in other fields of machine learning has also been interesting, including fields like video compression – I talked to some folks from Netflix about this, as it may prove useful for reducing the bandwidth load on the Internet from video. (Netflix and Youtube alone use something like ⅔ of the bandwidth in the U.S.) Generative models are also being used in sim2real work in robotics, as previously mentioned.
Finally, for the sake of completeness I’ll mention a few more areas which I witnessed smaller bits of. Autonomous driving is still seeing a large and heterogenous amount of work. It seems that we’re settling into a state of incremental improvement, where both research and deployment of self-driving is going to happen in fits and starts over the next several decades (e.g. local food delivery with slow, small vehicles and truck platooning are easier problems than autonomous taxis in cities, and will likely see more commercial progress sooner). On the other hand, deep learning for medical imaging appears to be maturing as a field, with numerous refinements and applications still emerging. Finally, I was also intrigued by a paper in deep learning for mixed integer programming (MIP). Traditional “operations research” style optimization like that which can be framed as MIP problems drives tremendous economic value in industry, and it will be interesting to see if deep learning proves to be useful alongside older techniques there as well.
Modern AI/ML, largely powered by deep learning, has exploded into a large and heterogeneous field. While there is some degree of unsubstantiated hype about its possibilities, there is also plenty of genuine value to be derived from the progress of the last 7+ years, and many promising directions to be explored as the field matures. I look forward to seeing what the next decade brings, both in research and in industrial applications.
Thanks to Shengjia Zhao and Isaac Sheets for helping edit this essay.