Weapons of Math Destruction in Hiring

October 21, 2018June 26, 2021 / Will Beason / Leave a comment

About a month ago I read Dr. Cathy O’Neil‘s book, Weapons of Math Destruction. It is a call to arms for data scientists and anyone in the automation field to handle their data responsibly. She lays out a framework for a code of conduct for people designing systems that handle people’s data, and how to avoid or at least become aware of the harm they cause society. Below is a draft of a speech I’ve prepared to give for my Toastmaster’s group.

Imagining weapons of mass destruction is something of an American pastime. The dangers are obvious – millions dead and large-scale infrastructure overwhelmed or simply annihilated. These effects grip modern culture, seeping into irresponsibly speculative news and defining Michael Bay and Tom Cruise movies. Fun, but not a useful discussion to have in your daily life. Instead, I want to cover the automated systems already harming us – what Dr. Cathy O’Neil terms “Weapons of Math Destruction.” A Weapon of Math Destruction is any scalable system that uses a model with the potential to harm society. In her book, Weapons of Math Destruction, Dr. O’Neil defines three types of WMDs: invisible systems, opaque systems, and obviously harmful systems. I’m going to define those three types of WMDs, giving illustrative examples from her book and from my experience in the hiring automation field.

Invisible Systems

First, invisible systems. Who here has been given a personality test while being considered for a job? How many of those employers told you that your response could automatically disqualify you? That’s an example of an invisible system – you don’t even know it’s there. On the surface, sure, no one wants to work with jerks so just screen them out. The problem is two-fold: since you’re not aware of this, you can’t appeal the decision and don’t know what went wrong. The second is that “jerk” is not a well-defined term (and there aren’t any tests for it). In this case, employers misuse common personality tests (OCEAN, MBTI, etc.) for something they weren’t designed for: candidate filtering. In fact, these tests were designed to help teams work together by helping members understand what makes each individual tick.

This specific issue has a history dating back to discrimination against people with mental illnesses such as depression, PTSD, and anxiety. Unable to legally directly filter out people with mental illnesses, employers fall back to the poorly-correlated results of personality tests. Systems like these have obvious potential for abuse. Since they’re invisible, there is no public accountability and no way to correct the harm these practices cause.

Opaque Systems

But what if a system is visible, but the owner doesn’t want to reveal how it works? This is an opaque system. Opaque systems are very prevalent in software, especially in artificial intelligence development. There’s a lot of startups that promise automated systems that match candidates to job openings, aiding or even eliminating the role of recruiters. On the surface level this seems like a great idea – by making it easier for companies to hire people, it will be easier for people to get hired. What you’ll note is that none of these services reveal how they match candidates – it may be proprietary logic or a machine learning system that obfuscates the logic even from the company using it. Candidates who sign up for these systems know that there is a matching algorithm, but they aren’t let in on how it reasons about them. Since candidates don’t know about the strengths and limitations of these systems, they can’t tailor their resumes or profiles. This is worsened further since recruiters usually have limited understanding of the skills they’re hiring for, and can reason neither about the system they’re using or the skills they’re looking for. The system could be biased by race or gender, and the software’s developers may not even know.

By choosing to make their candidate matching system opaque, these services discriminate arbitrarily against candidates who haven’t optimized their profiles them recruiters, and there isn’t a way for candidates to learn how to improve their odds. The system is unappealable and outsiders can’t reason about how it behaves, so they are powerless.

Harmful Systems

But what if a system is visible and the implementers are transparent about how it works? You’re still not out of the woods. Many companies do a credit check before making a hiring decision. This system is visible and transparent – you know they are checking your credit history and that they have some minimum bar they’ll use to make a decision. Again, this initially seems like a good choice – if someone isn’t able to handle their finances responsibly, how can you expect them to be responsible with their job?

Financial irresponsibility isn’t the only way to end up with bad credit. Someone could steal your identity. A hospital might balance-bill you for tens of thousands over what your insurance covers. Maybe you’re still recovering from the financial crisis. But even if you are at fault for your financial history, systemically denying you a job will only make things worse for you and people like you. This is a simple feedback loop: people with bad credit get fewer and worse jobs, so their credit score gets worse. This is one of the many systems contributing to the cycle of poverty in the US.

Conclusion

The companies using and profiting on these systems – these WMDs – rarely look at their impact on the world and are unlikely to share if they do know. As citizens we have to be aware that these types of automated systems exist and influence much of our lives. Invisible and opaque systems are unaccountable, and we have to push back because usually we only learn they exist and hurt people when they’ve reached a monstrous size.

As the designers of systems, we have to make sure we aren’t falling into these pitfalls. If you’re making something with the potential to improve the lives of millions and have any sort of professional ethic, how can you live not knowing whether it is actually having that impact? Do you really take pride in your system if you’re not willing to let someone independently verify its effects?

These WMDs are already hurting you and the people around you. I’ve only given examples in hiring, but imagine the collective effect of thousands of these systems across every industry – real estate, medicine, finance – each impacting millions of lives. You probably participate in several, and may even be building one for work. Algorithms to automate systems will only become more prevalent with time. We have to be ready, and responsible.

System Design and Self-Selection Bias

October 13, 2018June 26, 2021 / Will Beason / Leave a comment

Not all populations are created equal. Blindly designing a system without thinking about the pressures involved in the data you collect (or the people who will participate) can easily result in harm to society.

As an example, in a public online polls respondents are more likely to have strong opinions. Potential respondents who don’t have firm positions are less likely to see value in providing answers, and will be less likely to put effort into it. Drawing conclusions from an online poll anyone can respond to will incorrectly lead to the finding that people are very polarized on issues. Scientific polls have safeguards to prevent this sort of bias.

Self-selection bias in system design isn’t always obvious, so I want to discuss a more nuanced case.

There’s a trend in the US for developing new financial instruments. Markets in the US are not well regulated, so it is very easy for entrepreneurs to develop new types of financial contracts as means of making money. One such case is instruments mimicking reverse mortgages. For example, Point lets homeowners sell a percentage of their home in return for cash.

Homes and Liquidity

In economics, liquid assets refers to things that people can exchange without the item losing value. In practice, this means anyone can easily determine the item’s value and can exchange it easily. Cash is a liquid asset because its value is literally printed on the bill or coin and nearly anyone will accept it in exchange for goods immediately. The opposite, illiquid assets, refers to items for which this is difficult. For example, exchanging a home can take weeks and it takes a professional hours to determine a fair value.

For Point, its selling point is the option for homeowners to exchange a portion of their illiquid assets – ownership of their home – in exchange for liquid assets – cash. Fortunately Point is honest that they may offer less for the portion of the home that they buy. They may only offer $90,000 for 20% of a home which has been appraised at $500,000. In these agreements the homeowner’s net worth immediately decreases significantly – possibly by tens of thousands of dollars. (Just think of what would happen if the owner sold to Point, then immediately bought back that 20% in the case above.)

Self-Selection Pressures

In designing any system, we have to consider the pressures which will influence the statistical properties of the decisions it makes.

Aside: Homo Economicus and Homo Psychologicus

Homo economicus is a simplification of humans for economic modeling. It presents a human-like agent that acts rationally in its own self interest according to available information.

Homo psychologicus is a perturbation of homo economicus that takes into accounts the psychological factors all humans fall prey to. These models are more complex since they require more variables, but should be taken into account in situations where humans are less likely to act in their rational self interest.

Homeowner Selection

In this case, we must first ask: What sort of homeowner is likely to accept an offer from Point?

This is where self-selection bias comes into play. In the long term, it’s obvious that statistically the better option is to hold onto full ownership. (If it were not, Point would not have a business model.) So if we assume an owner has a base level of financial savviness, they will hold on to full ownership if they have the financial means to do so. Thus, we can expect many participating homeowners to feel some pressure to get cash quickly. We can then assume they are less likely to be financially stable – more likely to have a credit payment or a bill they need to pay off quickly. They are not reducible to homo economicus but must be modeled as homo psychologicus. They will be more prone to mistakes in reasoning and more impacted by biases.

Offer Selection

Next: What sorts of offers is Point likely to make?

This is a second source of selection pressure – it impacts the statistics of the offers Point is likely to make. Success for Point’s valuation algorithm is the money Point makes, so if functioning optimally the algorithm will offer the lowest amount the owner will agree to. As a business Point is likely to be functioning close to homo economicus, so we can assume they will make this rational decision.

Aside: Weapon of Math Destruction

The logic which goes into this offer cannot be appealed, and there is no way for the owner to know how the value was calculated – it is hidden behind an unquestionable proprietary algorithm. This algorithm is unlikely to be “fair” to the homeowner – its purpose is to make money for Point. Simply by measuring success this way and not being auditable, the algorithm will be predatory (even if no humans involved had this intention). This unaccountability makes it a Weapon of Math Destruction.

Agreement Selection

Combining these two selection pressures lets us answer: What sort of agreements are likely to be made and accepted?

When dealing with an owner who is not acting rationally, Point can make offers below what a rational homeowner would accept.

Point is most likely to enter into agreements with owners who have undervalued their own home. The greater the disparity between an owner’s valuation and what Point really thinks the value is, the more incentive Point has to make a deal. If the value Point offers is sufficiently lower than what a homeowner thinks it is, a homeowner will always reject the offer. If the owner does not have a reasonable understanding of their home’s value, they are more likely to think the offer is a good one. Additionally, if the owner’s rationality is compromised they are more likely to enter into a deal that is not in their best interests. For deals between two homo economicus we would expect deals only to be made when both parties rationally perceive they will profit. Since it is likely many owners will not be acting purely rationally due to other factors in their life, we can’t make this assumption. There will be owners who enter deals which hurt them.

The Sunk Cost Effect

By the time owners get to this stage, they have spent several hundred dollars in getting their home appraised and hours of their time. Point at most has wasted some of its employees’ time. It’s worth it to Point as they absorb some of this cost by offering other homeowners lower prices. But for the owner, they now have an appraisal that they wouldn’t otherwise have. For many the time, money, and effort they already expended will make them more likely to accept the offer even if it is below what they are comfortable with – this is the sunk cost effect. Since many potential participants already had some pressure to get money quickly, this has made their situation more dire and they are less likely to act rationally.

Conclusion

Should you use Point? That depends on your measure of success. If success is making a good return on investment, that depends on whether you can use the increased liquidity to make more than what you (effectively) paid Point and it is better than similar financial arrangements (e.g. reverse mortgage). If you are against systems that have the potential to harm society, you have to decide whether you trust Point has accounted for the damage it could do.

Point has the capacity to harm society if it isn’t careful. We can’t measure the impact on society since its valuation algorithms are hidden and Point is unlikely to share its data with researchers. Even if it cost them nothing to find out, it is likely they would choose not to know¹ whether their system caused harm. They would also be unlikely to share if they did know.

By default, the homeowners who self-select to enter agreements with Point will not be in a good financial situation, and so will generally lose net worth in the agreement. We can expect Point’s algorithm will move net worth from people with lower wealth to people with higher wealth – exacerbating the current wealth inequality problems our society faces. While it is possible for individuals to recoup the loss they incurred by purchasing the fast liquidity, this is not the default. If a homeowner has need of liquidity urgently, they aren’t likely to be using it in a way that will gain value (like investments) – they are more likely to need it for a large high-interest debt or unexpected bill.

It is important to consider these factors in any system being designed. It is possible Point has mitigated the issues I’ve described. This would require them to have a drive to actively ensure they aren’t being predatory of people in tough financial spots. This isn’t something done passively, but a professional responsibility they would have to choose to take. In the absence of evidence or any insight into their transactions and offer methodology, we simply can’t know.

*Economics for the Common Good*, by Jean Tirole, 2017, pp. 131-132 ↩

Overnormalization

October 1, 2018June 26, 2021 / Will Beason / Leave a comment

Why is it hard for computers to understand language?

This question plagues many a developer of NLP (Natural Language Processing) systems. While there are certain aspects of language we don’t know how to process yet, often we oversimplify language to make it easier for computers at the expense of maintaining the meaning. This isn’t a problem with processing power, but a conceptual limitation of humans designing these machines. In trying to understand how to deal with language, there’s many common mistakes that plague systems that try to interpret English. These cause the sorts of problems that make the users of these systems think computers will never really be able to interact on a human level.

An Aside: Normalization

Before processing text, most NLP systems normalize the text in some way to make it easier for the computer to understand. This may include steps like correcting obvious spelling mistakes, lowercasing all letters, and removing superfluous spacing. The idea is that none of these modifications really changes the meaning of the text, and there’s no need to develop a machine that can (and thus, has to) learn that when a sentence has extra spaces in the middle that it rarely means anything interesting.

Overnormalization: Ignoring Capitalization

Making all text lowercase before processing makes sense. For example, there’s obviously not a significantly big difference between For (as it would appear at the start of a sentence) and for in the middle. By default, a machine would treat For and for as completely different words and not know that they are very related. By lowercasing all text, we eliminate this class of mistake. It also nearly halves the number of words the machine has to learn. This can be a massive help since most words rarely appear capitalized, and if the machine sees the capitalized form of the word for the first time in the wild (rather than in training), it will immediately connect it to the word it already understands. There might simply not be enough training data for a computer to learn that Lanthanide and lanthanide are the same word just from context. However, this simplification has unintended side effects.

Consider these three sentences and their intended meaning.

I saw it. = I saw [something referenced in another sentence].

I saw It. = I saw [the film It].

I saw IT. = I saw [the Information Technology department].

Any system that blindly lowercases everything will treat these sentences identically and appear comically inept. They have three completely different meanings and as humans we can see the distinction immediately. What’s worse is when low-level machine learning models are trained on them, and they are used to feed more sophisticated models.

A technique we use to teach computers relationships between words is word embedding, which is a type of model that can be thought of as a map containing every word: words closer to each other have “more similar” meanings than those far apart. The model “learns” from a bunch of sample sentences we feed it – usually hundreds of thousands. In this case, by lowercasing everything we told the machine that it, It, and IT all mean the same thing – they have the “same location”. This not only corrupts the computer’s understanding of those three words, but anything even tangentially related to them. Words related to IT like administrator and support will be incorrectly be considered similar to ones near it such as that and this. Now if that faulty word embedding is used to train an even more complex model, it will compound the problems. Consider that there are literally hundreds of examples where capitalization matters in English, and there will be many bits of language the computer will have trouble understanding.

Solution: Variable Granularity

We appear to have competing requirements:

We want our machine to ignore differences in capitalization when they don’t matter.
We want our machine to pay attention to differences in capitalization when they do matter.

I suggest adding a third requirement, one that suggests a solution.

We don’t want to have to tell our machine each case where capitalization matters.

We could literally enumerate all instances where people capitalized words in a non-standard way, but that isn’t practical and the system won’t automatically figure out new instances. If we could automatically detect when capitalization mattered, then the first two requirements become non-issues.

A word embedding needs about one hundred example word usages to “learn” what a word means and use it as an anchor point to understand similar words. While educated may appear in many hundreds of sentences, erudite may appear in only a few, but it will appear in contexts similar enough to educated that the machine will figure out that the words are very similar. We can leverage this limitation – by declaring that if a word appears fewer than 100 times then we accept that the machine will sometimes make mistakes with those words.

We can turn this threshold into a rule which determines whether to create an entry for the word’s word embedding:

If the exact capitalization occurs 100 or more times, make an entry for it.
If the exact capitalization occurs fewer than 100 times, use the entry for the most common capitalization (or create one if it does not exist)

Word	Appearances	Entry
rest	5	rest
REST	3	rest
Rest	1	rest
reST	1	rest

Even though most of these do have significantly different meanings, there wouldn’t be enough information for the computer to figure out the difference. Now suppose we collect significantly more data.

Word	Appearances	Entry
rest	500	rest
REST	200	REST
Rest	10	rest
reST	5	rest

Capitalization

There is now ample data for the machine to see that rest and REST are used very differently. Both should have their own entries on the word embedding. Until “Rest” and reST have enough training examples, they will be grouped under a default – probably rest as it is the most common. While this correctly labels Rest as identical to rest, it still incorrectly groups reST with them.

This method may initially seem to fail for highly common words:

Word	Appearances	Entry
the	10,000	the
The	1,000	The

In this case, the logic will unnecessarily create an entry for both the and The (I would argue there are meaningful distinctions, but that discussion would be its own post). This behavior only impacts incredibly frequent words, but since those words are very frequent then the machine will have enough information to learn that they are very similar. Processing time will be several percent slower since this increases vocabulary size by several thousand words, but when we’re dealing with hundreds of thousands of unique words this isn’t a major issue.

The main limitation of this algorithm is that if there is little data for a given capitalization, the machine will automatically assign it the meaning of the most common capitalization. However, this is obviously better than the behavior of most systems now which unconditionally assign all capitalizations the same meaning.

Conclusion

Understanding language is hard. Taking shortcuts can still produce cool results, but introduces additional limitations to anything that depends on it. Approximations make things easier – remember the spherical cow from Physics – but they always produce imperfect models. In the case of capitalization and language processing, getting rid of this approximation is relatively straightforward and we can immediately realize benefits while only paying a small computational cost.

In either case, as a layperson or someone consuming a product that promised “natural language understanding”, be aware that these approximations (and their associated problems) exist, and consider the harm that could be caused by neglecting them.

How to Make Your Resume Searchable

August 26, 2018June 26, 2021 / Will Beason / Leave a comment

I previously covered some of the systemic problems that exist in recruiting today. In it, I mentioned that one of the first steps in the recruiting process is a candidate search engine that analyzes many millions of candidate resumes and professional profiles to find ones matching a given set of criteria.

Whether you like it or not, any resume you send to a recruiter gets packaged with millions of others and sold. The same goes for any site that lets you build a professional profile. So it’s already likely you’re in many databases that recruiters pay to have access to and search, and someone is making money on what you’ve written. Here’s how you can make the best of it.

A Note on Text Search

The three main audiences of your resume.

Two of the (many, many) rules text retrieval systems usually follow when ranking documents are:

How rare are the terms being searched?
How often do the terms appear in the documents relative to their size?

The first means that if your resume has rare terms and someone searches for those same rare terms, your resume will be boosted higher. Also, rarer terms are weighted more heavily – specifically, those terms appearing in fewer documents. So if a searcher typed “database” (more common) and “MongoDB” (less common), it will rank a resume only mentioning “MongoDB” higher than one only mentioning “database”.

The second means that longer documents are penalized if they don’t use the term often. This is usually calculated with word count – so a resume with 100 words mentioning “MongoDB” once will have a higher score than one with 1,000 words, but will be scored the same as one with 1,000 words that mentions “MongoDB” ten times.

The combination of these two mean you want to be as concise as possible for smaller text fields in your profile like “degree” and “job title”. The more words you put in them, the lower your profile will be ranked.

Recall that you are writing your resume or professional profile for three completely different readers:
1. search engines
2. sourcers
3. hiring managers

In this post I’m just focusing on (1). It may be a good follow-up article for me to muse on how to handle writing for groups (2) and (3).

Check Your Spelling

It should be obvious that computers aren’t great at guessing what you mean when you misspell things like names and titles. Until someone tells the search engine otherwise, “software” and “softward” are completely unrelated even if a human can understand it immediately. If you don’t spell things on your resume well, it will not be searchable. I’ve encountered terrifyingly large numbers of misspellings in degrees – literally over 100 different ways of misspelling “bachelor”. I haven’t decided whether the ten common misspellings of “doctorate” are more worrying.

Degree

Use the full name of the degree – no abbreviations. Think “Bachelor of Science in Computer Science”, not “BSCS”. There are many uncommon abbreviations that recruiters simply will not know and won’t bother to look up. My favorite is “EDM” which is a “Master of Education”, not “Electronic Dance Music”. Further, they aren’t going to include abbreviations they don’t know in their profile searches. And do be sure to include the level of education; resumes simply listing “CS” as the earned degree could mean many different things – you won’t get the benefit of the doubt.

Avoid including explanatory text like “Earned a BA in Communications” or “BA in Communications and a Minor in Interdisciplinary Studies”. The “earned a” is superfluous, and if the minor really is important then it should be listed as a separate degree. For real though, only the hiring manager is likely to care about your minor, and if at all only a very slight amount.

If you are interested in working in an English-speaking country, don’t go for the fancier-sounding “baccalaureate” over “bachelor”. You are even more likely to misspell it. I’ve seen cases where applicants even went as far as including the accent, but in the wrong place. If the search engine isn’t using a text normalization technique (e.g. one that removes accents, you’re out of luck). Similarly, English-native recruiters rarely think to type “baccalaureate”, so if you go that route you are unlikely to appear in results anyway.

School Name

As a best practice, use the name of the university on the institution’s LinkedIn profile. Not “Cambridge”, but “Cambridge University”. In this case “university” distinguishes you from “Cambridge College” graduates. Not “MIT” but “Massachusetts Institute of Technology”. And, dear god, never use something like “U of M”; there’s no way to figure out what school you are actually claiming to have attended.

Avoid including the name of the specific college within your university that you attended. If you went to Berkeley and got a business degree, that is the same as saying you went to the Haas School of Business. Candidates are very inconsistent with how they include school names, with everything from “Haas Berkeley” to “University of California, at Berkeley, the Haas School of Business”. It’s just difficult for the search engine to know that your profile should be ranked the same as the person who typed “University of California at Berkeley”.

Don’t combine multiple schools you went to in one line or entry. If you write “Harvard, Stanford, UCLA” as the name of the school you went to then you run the danger of not being found when someone searches for any of those schools. At best, your score will be one-third other candidates.

Company Name

Much of the advice from the School Name section applies here.

Use your employer’s name as listed on the organization’s LinkedIn profile. If you worked for a specific well-known division or product of your company (e.g. “Walmart Labs” or “YouTube”), use it instead. Otherwise, mention the division or product in your job description.

Again, avoid explanatory text. Mentioning “internship” is accurate, but will just cause you to be ranked lower. By all means, include it in your job title, but the company name field is not the place. For “Contractor”, the best place for this is the job description.

Job Title

You can put whatever job title you want on your resume. Your resume isn’t something for former employers to check, it is how you are presenting yourself to future employers. Obviously don’t lie or misrepresent yourself, but feel free to choose a common synonymous title over the specific one you may have been assigned. I’ve seen too many cases like “Integrated Data Network Engineer Level IV” who will never be found among the deluge of “Network Engineer”s.

If you are a contractor, the most common practice (of many, many different practices) is to append “(Contractor)” to the job title. This is up-front and honest, but I feel unfairly penalizes them in searches. Beginning your job description with “Contract work for …” is fine, and makes it more likely you’ll be seen.

Search ONET OnLine to get ideas of common job titles in your field. If you’re willing to do the full legwork, look for the occupation whose description most closely matches your functions in the Standard Occupation Classification and either use one of the examples directly, or take it back to ONET as inspiration for a search.

Don’t use meaningless job titles. Many people list things like “Specialist”, “Summer Intern”, or “Assistant” as their title and there’s no way to know what they mean. If you are a specialist, say what you specialized in. If you were an intern, be complete and say you were a “Software Developer Intern”.

Conclusion

Much of this isn’t obvious advice. We’re dealing with imperfect systems that aren’t optimized for resumes being used to search resumes. If you were only writing for humans proficient in your job functions, your resume should look very different. But this is the system we have for now, and you’re part of it whether you want to be or not.

Do remember that you can always send recruiters and hiring managers an updated resume when they contact you

What’s Wrong with Recruiting?

August 24, 2018June 26, 2021 / Will Beason / 1 Comment

Have you ever received an unsolicited message from a recruiter about a position you’re not interested in? Do you ever get passed up for positions you are qualified for before you’ve even interviewed? There are reasons for that, and they kinda suck.

I’ve worked in the HR automation field for about a year now, and this is what I’ve seen:

Sourcers often aren’t familiar with the jargon of the positions they’re hiring for.
Search engines aren’t good at ranking candidate profiles.
Candidates don’t know how to write their profiles to make them easily searchable.

An entity relationship diagram of candidate sourcing, with the three relationships that cause the most pain in red.

Sourcers

Sourcers are the people who look for candidates who are both qualified for open positions and are interested in filling them. The people who do sourcing often have the job title “recruiter”.

The process of finding and hiring a new employee can roughly be broken down into the below steps. (Depending on the company, steps may be condensed or done by the same person.)

A manager tells a hiring manager that they need someone with some set of qualifications/skills.
The hiring manager write a job requisition.
A sourcer reads the job requisition and looks for candidates who meet the qualifications.
The sourcer contacts the candidates, verifies their interest, and refers them to the hiring manager.
The hiring manger / team / etc. interviews the candidate.
The candidate is hired.

The issue here is with step 3. Sourcers generally have very little training, learning most of their trade on the job. They also hire for many, many different positions. The same sourcer may look for accountants, software developers, and mid-level managers. There’s so much jargon and so many different skills to juggle that sourcers rarely get much more than a superficial understanding of what they’re looking for. They don’t have the time to absorb the overwhelming amount of information in every profession. (Hiring managers, luckily, tend to specialize and pick up on the nuances of what they’re looking for)

Recruiters source for many, many different roles and don’t often have a deep understanding of what they’re looking for. A recruiter sourcing their first “Database Architect” may discard a candidate with a decade of “NoSQL” and “Data Modeling” experience because they aren’t familiar with the field. This can cause problems for candidates who use jargon that is too specific – they may be passed up because their resume or professional profile isn’t comprehensible to a layperson.

Candidate Search Engines

Sourcers usually go after passive candidates. As opposed to active candidates who are currently looking for a new position and applying to openings, passive candidates are waiting for opportunities to come their way. The set of potential passive candidates is in the many millions – it’s literally the entire job force. There’s no way a human can look through this set for every opening.

Fortunately, sourcers have tools that make it easy to filter down the candidate pool and sort candidates by different criteria. LinkedIn Recruiter, for example, gives recruiters a search engine for everyone on LinkedIn (it’s one way LinkedIn makes money on your professional profile). Like many other such tools, it gives sourcers the option to look for candidates with specific job titles, skills listed, and degrees. On the surface, this sounds great.

In the real world, people are messy. For the most part, these are text searches of what candidates have typed. Did you type “Master of Business Administration” while the sourcer searched for “MBA”? Tough luck. Did you say “MySQL” when the sourcer searched “SQL”? No dice. Are you a CPA but the recruiter typed “Accountant”? Nope. Are you a “Software Engineer” but the sourcer typed “Software Developer”? Unless you type what a sourcer thinks to type in their search engine, your profile won’t appear.

A common response sourcers have to this is to construct terrifyingly elaborate boolean queries containing hundreds of variations on titles and skills. Sourcers sometimes share parts of their queries, and some sourcers don’t even know how parts of their own queries work. If a part of the query breaks it may take hours or days to find the problem and fix.

Candidate Profiles

The above problems are unintuitive. There’s no way for a candidate to know that sourcers are (1) passing them over for using jargon more specific than the sourcer knows or (2) just not seeing them because they don’t match their search queries.

Candidates are startlingly diverse in listing their qualifications on resumes. Even seemingly-limited fields like “degree” may have tens of thousands of variations. Once you get to a more varied field like “educational institution”, you can end up with tens of variations for the name of a single university! This isn’t just misspellings: many people include the college within the university, their major, acronyms, and explanatory text. Job titles are an order of magnitude worse (I measured it).

The candidates who receive the most unsolicited messages about job openings are simply those who have typed what the average sourcer thinks to look for. I should do a follow-up article on advice for specific fields in a resume or job profile, but the gist is to keep in mind that your resume has to be general enough that someone unfamiliar with your role could do a search and find you, but specific enough (e.g. in job experience descriptions) that it piques the hiring manager’s interest.

AIDP: Choose the Right Level of Abstraction – Part 1

August 21, 2018June 26, 2021 / Will Beason / Leave a comment

Continuing my AI Design Principles series. While this post doesn’t specifically reference AI design, it is to open a discussion that will continue in Part 2.

There’s a common challenge in system design in general that applies strongly to AI design – making sure that your system solves the problem at the correct level of abstraction. In general, this can be posed as the question “does the system let users communicate my problem in the same way I think about it?”

Imagine the interface to work a car:

the steering wheel
gas pedal
brake

CarInterface

This is exactly the correct level of abstraction – each component has distinct uses that are mostly orthogonal and the controls correspond to how we think of driving, e.g.:

“turn left” -> rotate steering wheel counter-clockwise
“go faster” -> press down on gas pedal
“make a sharp right turn” -> use brake to slow down to appropriate speed and turn steering wheel clockwise

Level of Abstraction too Low

This is the mistake I see most often in system design – making controls too granular.

Imagine if instead of one steering wheel you had four, one for each wheel. This would be madness and unnecessary for most people. While you would (technically) have the ability to steer more precisely than with just the one steering wheel, most people would not intuitively know how to use the controls to steer the car. For example, you can’t just rotate all of the wheels the same way – that would change the direction of the car but not actually change the direction it was pointed.

Designers make this mistake when they design too much for super-users. While they exist (and may be very loud), in most cases they aren’t the bulk of your users. Adding the ability to tune everything is often at the expense of making it easier for new users – adding a large learning curve as well as usually mandating setup before even using the tool at all.

I recognize this one when I think “I know it’s possible to communicate what I want to the system, but I have no idea how.”

Level of Abstraction too High

This is the opposite case – where individual controls try to do too many things that common activities aren’t possible. I see this is less often, but it’s no less a hindrance.

Now imagine if your steering wheel controlled both direction and acceleration – with the acceleration lower the more you turn the wheel. The designers were probably thinking “When I’m not turning I want to go faster, and when I’m turning I want to go slower, so let’s tie those things together!” They neglected the obvious (to the user) case of stoplights and traffic.

This is more often made when designers don’t go through the legwork of requirements development. I’ve had the experience more than once of sitting in a room where people are happy to hallucinate the problems of users and avoid actually talking to them. If you only understand a subset of the problems users have then your solution will be incomplete.

As a user, this one is characterized by thoughts of “It isn’t possible for me to communicate what I want using the controls given to me.”

If Users Think at the Wrong Level of Abstraction

One danger is if users have only worked with solutions at the wrong level of abstraction. They may have been trained to think at the wrong level, and in requirements development your job is to divine that.

Suppose you are designing an app that lets people make artsy customized tables and all of the competing apps require users to create and upload .svg files of the shape and size of the table top. When you go to users, they talk about streamlining the .svg upload controls to make things like scaling easier. Another common complaint is that it is too hard to find exactly the coloration pattern they want – the selection other apps offer isn’t big enough. You ask several to get a sense of the sorts of tables they make, finding that most go with rectangular table tops but vary wildly on the coloration.

It strikes you that it should be easier to tell the system “I want a rectangular table top with these dimensions” since it is such a common use case. The bit of intuition here comes from looking for patterns in how users use the tool. When a user thinks “I want a rectangular table top” they aren’t thinking “I want a table shaped like this .svg file” even though that may be what they may literally say. The users are thinking at too low of a level.

On the other hand, you notice that the table coloration patterns users want varies so wildly that each requested pattern would essentially just be used by one person. As many of the users are artists themselves, they show you pictures they’ve drawn of their dream table coloration but have no way of telling the apps (or, for that matter, finding a close one in the thousands of different patterns available). In this case the users are thinking at too high a level of abstraction – there really needs to be a way for them to just upload their own designs as the coloration pattern.

Choosing the Right Level of Abstraction

This requires actually talking to the people whose problem you’re solving – requirements development. The point – too often overlooked – is to figure out (1) what problems your users have and (2) how your users think about those problems. They’re not the system designer – you are – so of course you won’t usually be able to literally use their suggestions. That doesn’t make what they say any less valuable.

When I go through this part of the design process, I ask myself these questions about the system I’m designing. Does the system:
1. allow users to solve most problems the users have?
2. communicate possibilities in the same language users think of their problems?
3. make it easy to solve problems that are easy to think about?
4. allow for too many nonsensical inputs?

In the car example where the level of abstraction was too low, it is easy to think “I want to turn left” but difficult to communicate that to the system through the four steering wheels. This violates (2) and (3). It also lets the user do many things that don’t make sense – like turning the left wheels to the right and the right wheels to the left, violating (4).

These aren’t easy problems to solve. Choosing a good level of abstraction requires a mix of talking to people, thinking about how they perceive their problems, and being aware of the larger context influencing how people talk about what they need.

AI Design Principles: Choosing the Right Problem – Part 2

July 6, 2018June 26, 2021 / Will Beason / Leave a comment

Part 2: Begin with a Decision that Many People Make Often, and Make Quickly

You’ll shoot yourself in the foot if you try to solve the sort of problem only a 10th-level wizard specializing in conjuration makes the first Tuesday of every prime-numbered year. Decisions made rarely or by few tend to be very difficult or to have little generalizable utility.

What is the ideal flux density be for each individual magnet in a particle collider with maximum experimental resolution around the 125 GeV range?
Is Dragon’s Egg or Mission of Gravity a better first novel to study in a Hard Science Fiction class?

Sure, it might be fun to build an AI that could actually solve these problems, but for now it’s much more efficient to leave rare problems to humans. Remember – we spend most of our time mired in decisions everyone makes.

Begin with a decision problem where

it takes less than a minute to make the decision
lots of people make this sort of decision, and
people who make this decision tend to do it often.

This is really a litmus test for deciding whether a problem meets the requirements mentioned in Part 1. A problem that passes this test will satisfy many of the requirements.

Choose a decision problem where it takes less than a minute for a person to make this decision.

It should take less than a minute to make this sort of decision. If it takes one minute to make the decision, you can only manually check about 500 samples per day. Beyond this point you lose the ability to reasonably manually verify the correctness of your results. That is, assuming you have requirements calling for over 90% accuracy. If you don’t have a lot of labeled data, this small of a return on time invested in labeling data that takes long to label isn’t usually worth it. I’d also question: if it takes more than a minute to make the decision, is it really not reducible to a set of smaller decisions?

Say you’re looking to make an AI to help decide what expensive watch to buy. Things that might go through your head when making the decision might include:

Is it within my budget?
Is it the color I want?
Is it available in my size?

These are simple problems that an AI assistant could use to automatically discard watches that aren’t worth considering, leaving you to focus on:

Is it comfortable?
Do I like the style?

Further, the easily automatable pieces of the problem are generalizable. They aren’t just applicable to watches, but to shoes, shirts, and a variety of other clothing items and accessories.

Choose a decision problem where lots of people make this sort of decision.

If many people can make the decision, you can easily check your work by collaborating with them. You’ll know you’ve found one if there is an entire profession with people constantly making this decision.

The decision-makers are your source of requirements (hint: you’re making this algorithm to automate away the more tedious of the decisions they make; you’re building it for them). They’re a great source for labeled data. If labeled data isn’t easy to come by, you can usually

outsource data labeling to them,
consult them on tricky cases, and
literally use their daily work as training data.

Most preferably, as the designer of the algorithm YOU should know how to make this sort of decision. If you learn how to make the decision, or at least get some intuition, it saves you a lot of time when verifying system performance.

Choose a decision problem where people who make this decision tend to do it often.

This gets at the amount of data which may exist, or is easily generatable. If it’s a once-per-year decision, you aren’t likely to have a lot of historical data to base your model on.

If your system is designed to be used in users’ everyday lives, they will see the improvements from your system constantly. Highlighting common potential errors in documents as they’re typed is a prime example: people need help ensuring they typed a word correctly or haven’t made some easily-catchable mistake. Instead, they’re free to focus on what they actually want to say rather than whether it’s really spelled “concured” or “concurred”.

If it cost me nothing, given the choice between not having to make an almost-mindless near-daily decision and one I’d have to spend hours contemplating but may only do once in my life, I would choose to do away with the former. What’s great is it usually ends up being simpler to automate as well, so not only is the cost lower, but the reward is greater.

AI Design Principles: Choosing the Right Problem – Part 1

July 3, 2018June 26, 2021 / Will Beason / 1 Comment

Part 1: Begin with a Simple, Easy Decision Problem

DecisionProblem

If there’s a big mistake people make when designing machine learning systems, it is deciding to tackle the wrong problem. Pick the wrong problem, and you can easily spend months bashing your head against one that would require a research team years to solve properly. Most companies don’t have that magnitude of resources to devote to solving an individual problem, and it isn’t cost effective anyway.

How do you decide what problem to solve? Begin with a simple, easy decision problem.

What is a “simple, easy decision problem”?

A simple, easy decision problem is one that

involves automating making a decision
has a well-defined set of mutually-exclusive possible decisions
has explainable decisons that reasonable people would agree on
is so simple that it cannot be decomposed into smaller problems worth solving
has a fast, easy solution

Why?

Choose a decision problem to automate. Decision problems automate thought processes that humans do. If you really just want to understand your data, then at this stage you really want data exploration and analysis (possibly using machine learning). In general you wouldn’t automate a process like k-means clustering – you’d do it with a specific purpose in mind. On the other hand, you would automate a process like “Should I switch the light in the north-south direction at this intersection to red?”

Choose a decision problem with a well-defined set of possible decisions. If the set of possible decisions the machine might make is unbounded, or it isn’t clear what a decision means you’ll have problems. If there are infinite (or simply so many a human couldn’t reasonably consider all of them) number of possible decisions, the simplest algorithm which can produce all possible answers is very complex. You lose the ability to train on each set of possible responses as gathering data on every one may be impractical.

On the other hand, deciding “Which one of these ten genres does this book fall in based on its title and text?” is well-defined. As a person making the decision, you list out the possible genres and pick the one the book best falls in. The process for the machine is analogous – it may calculate scores for each genre and pick the top one.

Choose a decision problem with a set of mutually-exclusive decisions. If there are multiple correct decisions for the same set of inputs, you lose the ability to specifically train the model. If any combination of answers is permissible and you have more than, say, 20, you really have an answer space with over a million possible choices. It also complicates training the model. Say you’re training a chat bot to answer natural language questions users pose. If the chat bot gives three of five essential pieces of information when responding to a particular question, “how correct” was it, and how should the interaction be counted in training or evaluation?

Choose a decision problem where reasonable people would make the same decisions. Suppose you’re building a system to pick a wall color palette for a room based on the furniture the owner already has. If you ask five interior designers you’ll get five reasonable, but different, palettes. Do you use all of these as positive training examples? How do you evaluate whether it made the right decision on a new room and set of furniture?

Choose a decision problem that is so simple that it cannot be decomposed into simpler decision problems. Let’s say you’re building an app that automatically generates a grocery list for users based on what they have in their refrigerator. Considering every possible shopping list simultaneously would be a nightmare. Instead, we can decompose this into one problem for each food item, and a larger problem that merges these decisions into a single grocery list.

The model for deciding whether to include each item on the list might based on:

Does the user currently have any of this item (or similar) in their refrigerator? Is it expired?
How much does the user usually consume per day? Week? Month?
Is the user’s past consumption of this item regular or spurious?
Does the user already have items that can be used in many recipes with this item?
Is this item easily available to the user?
Has the user indicated they are allergic / don’t like this item?

The model that merges these sub-solutions might be based on:

How wide a range of recipes does this list, plus their at-home stock, allow? (Also, prioritize recipes the user has favorited, or are similar)
Does the user have enough available space to store everything on this list?
What items can be eliminated that cause the smallest decrease in potential recipe variance?

At the point where we’re combining the outputs of many decision models we’ve technically diverged from decision problems, but I think the idea of merging solutions this way is powerful. The point is the smaller per-ingredient decision problems are individually good starting points.

Choose a decision problem that has a fast, easy solution. Greedily, the longer before you have a working prototype or implementation, the longer before you see returns from the work that went into your work. More importantly, the sooner you see how implementing the model made improvements in someone’s workflow or life, the faster you’ll be able to iterate on the model to make it even better. You’ll get quick feedback on a simple solution, so you’ll be able to grab more low-hanging fruit if you want to improve this model. If the solution is good enough for now, then it frees you to go and work on another problem!

Planning and Entropic Forces

October 8, 2017June 26, 2021 / Will Beason / Leave a comment

The Foundation series is a classic science fiction series by Isaac Asimov. In it Harry Seldon, a scientist of the fictional field of psychohistory, predicts the collapse of the galactic empire by modeling the future of humanity with entropic forces. Seldon devises a plan to ensure the best future for humanity after the collapse. He forecasts a thousand years of future, then sets up a series of actions in motion to ensure humanity is able to go down a highly unlikely specific path and end up at the desired future.

I think he got it wrong.

By focusing on one path, Seldon limits the potential good futures to one very unlikely path. Spoiler: guess how that worked out. Despite understanding how entropic forces made the end of the empire all-but-inevitable, he never turns this idea to humanity’s advantage. He should have taken actions that maximized the potential future paths with desirable states.

What does this mean?

The more ways a plan has of succeeding, the more robust it is. Robust plans don’t fail at the first unforeseen challenge. Consider an example from technology – database redundancy. If everything is stored in one database and the database goes down, the entire system stops working. However, if there are three copies of the database (far enough away that failures are independent) then the likelihood of the entire system going down are minuscule. As we increase system redundancy, the entropy of a state where all systems are failing approaches zero, and the entropy of states we want grows without bound.

(In this situation there are practical redundancy limits, and the best practice is usually to go for at least two more than the number of databases for the system to run under the heaviest expected load. It may be interesting to explore the math later …)

How do I use this in planning?

Have more than one path towards what you consider “success”. Suppose your goal is to work in position X at Company Y. Naively, your path forward is to apply for position X. You have some probability of getting the position, and you either get it or you don’t.

There are more paths than these.

What are other paths that accomplish your needs? Do you need the position right now or can it wait a year? Does it need to be that position at that company, or are there similar opportunities at the same (or at similar) companies? Is it acceptable to apply, fail and get feedback, then try again? Is there, perhaps, an intermediary stepping-stone job? A degree that might increase your chances? A meetup group where you might run into employees who could refer you?

The answers to all of these depends on your constraints – your requirements. If you want one specific future then you really only have one option and entropic forces will (statistically) work against you. If more than one future – or more than one path – is acceptable, then you can use them in your favor. In general the more paths and the more likely you make the success of the paths, the higher the entropy of desired futures is.

But what does that look like?

I recently switched jobs. I decided I wasn’t happy on my current team and I needed a change. There wasn’t anything urgently wrong, so I was fine with waiting up to a year. After thinking about the sorts of changes I wanted, I realized there were three main categories of futures I would consider a success:

finding a team I would be more happy with at my current company,
finding a new company I would be more happy at, or
going back to school for a graduate degree.

I added constraints to each future. For example, there are companies with cultures I would never want to work at. I wanted to stay in the Bay Area, so that limited both schools and companies I could look into.

Then, I broke each general future into a general sequence of tasks required to make that future happen and ones that would make the future more likely. Required tasks generally fit into two groups: ones that cut off my ability to pursue other paths (e.g. accepting a job offer) and ones that moved the path along while changing (usually reducing) the likelihood of the others (e.g. doing an onsite interview). While the required tasks varied greatly, the optional tasks presented a lot of overlap in the futures: networking, studying, and understanding my desired team qualities were common to all three.

By doing the tasks that made all of the successful futures more likely, I increased the entropy of futures where I succeeded in my goal. My plans were robust in that there was no single point of failure that could topple them – there were more companies to apply to, many possible teams, and if those didn’t work out in a year then I also had applications for school in the works. Throughout it all, I would be gradually improving the likelihood of each subsequent attempt.

Briefly, these are the steps this approach suggests:

What futures do you consider successful?
What are your constraints on these futures?
What increases the likelihood of each future?
What are common actions that increase the likelihood of multiple futures?
In general do as much towards the common actions as your constraints allow, and do specific actions as necessary.

New Data Set: Changelists

October 12, 2015June 28, 2016 / Will Beason / Leave a comment

So I found a data set that I’ll find interesting. I’d like to optimize how quickly I can make changes to the Google codebase. I’ll have to ask internally about the extent of the information I’ll be able to share on this project, but here’s my plan. If I’m lucky, I may be able to share my anonymized data and data analysis code.

Continue reading →