Idealistics is shutting down. This blog will continue to be updated at
The trouble with benchmarks

I just got back from the annual Independent Sector conference, which brings together non-profits, foundations, and self-promoting consultants (sorry about that) to discuss the direction of philanthropy.

The theme of the conference echoed that of the online philanthro-sphere, namely, most every session and discussion had something to do with data. I had some nice chats with folks about the emerging Markets for Good platform, attended a session on using communal indicators to drive collective impact, and heard one too many pitches about how this or that consulting firm had the evaluation game pretty much locked up.

What many of these conversations had in common was a focus on setting benchmarks to compare progress against. Benchmarking is a quick and dirty tool for trying to estimate an effect over time, but it can be misleading, a fact that was not discussed in any of the sessions I attended.

Benchmarks essentially require one to measure an indicator level at an initial point in time, using that initial measure as a baseline for the future.

For example, a workforce development program might measure the percentage of people in its programs who found work in the last year, using that percentage as a baseline to compare future employment rates against. A year later, the program would look at this year’s employment rate and compare it against last year’s benchmarks.

In this simplistic scenario, one might assume that if the employment rate is better this year than last year’s baseline, then the program is doing better, and if this year’s employment rate is below the baseline then it is doing worse. But, as the title of this post gives away, there are some things to consider when using baselines.

As I have written in the past, social sciences are particularly complex because there are so many external factors outside our program interventions that affect the lives of those we aim to serve. In the employment baseline example, a worsening economy is likely to have a larger effect than the employment services themselves, all but assuring that the next year’s employment rate will be below the baseline, even if the program was more effective in its second year.

Under the collective impact benchmarking model, we would collectively flog ourselves for results outside of our control. Likewise, we can also see the opposite effect, whereby we celebrate better outcomes against a previous benchmark when the upward swing is not attributable to our own efforts.

So, is benchmarking useless? No, but it also should not be confused with impact. A mantra I preach to my customers and brought up in many conversations at the Independent Sector conference is that it is equally important to understand what your data does say, as well as to understand what it does not.

The simple difference in two outcomes from time A to time B is not necessarily program impact, and cannot necessarily be attributed to our awesomeness.

Benchmarking tells us if an outcome is higher or lower than it was in the previous period. But subtraction is not really analysis. The analysis is trying to tease out the “why”. Why did an outcome go up or down. Was it the result of something we did or some external factors? If the change is attributable to external factors, what are those factors?

In short, benchmarks can help you figure out which questions to ask, but benchmarking itself does not provide many answers.

I’m encouraged by the buzz around using data and analytics, but am cautious that data is kind of like fire, and it’s important that we know what we are doing with it, lest we set ourselves ablaze.

If you had to choose between an evaluator or a marketer

Earlier this week I asked the following question on Twitter

If you had to choose between an evaluator or a marketer for your #nonprofit org, which would you pick and why?

I had plenty of interest in the question, but only one answer. Ann Emery pointed to a 2010 study by the Innovation Network about the State of Evaluation in 2010. The salient point from the report that Ann pointed out is that in an online survey of over 800 non-profit organizations across the United States, “fundraising is #1 priority in nonprofits while evaluation is #9…”.

This statistic provides some support for the open secret that organizations prefer investing in marketing over evaluation, but it doesn’t answer the second part of my original question, why do organizations choose marketing over evaluation?

While not intentionally an answer to this question, a nonprofit I have been almost working with for the last year provides some insight. This organization is relatively young, small nonprofit that has found itself a media darling in certain circles. It plays lip service to a desire to evaluate its findings but insists that anyone who look at their numbers be properly vetted (read already a believer in this agency’s approach).

Evaluators are inquisitive, and skeptical, by nature. A hypothesis test assumes there is no effect, rejecting this assumption only in the face of convincing evidence. Evaluators do the same thing.

This organization (not-uniquely) starts from the standpoint that its impact is a given. In that mindset, evaluators can only disprove your asserted greatness. Thinking of it that way, I’m not sure I’d hire an evaluator either.

An investment in marketing however brings accolades from the press, photo ops with politicians and the adoration (and financial support) of the general public. So really a choice between marketing and evaluation is a choice between fame and fortune versus the possibility of uncovering that the project you have invested in for over half a decade doesn’t do what you thought it did.

In this way, choosing the marketing consultant is the only rational choice to make. Well, that is, if your organization’s logic model defines an ultimate goal of self aggrandizement. If instead your target population are the people your agencies aims to serve, and the impact theory defines causal linkages between your interventions and something other than coverage in the New York Times, then an evaluator might be an okay idea after all.

Consumer protections and false advertising

Corporations invest in evaluating their products in part because better products are more competitive in the market place, but given the indirect funding nature of nonprofits, where the service recipient is not the purchaser of services, this incentive falls apart.

However, corporations also evaluate their products as to not run afoul of the various consumer protection regulations placed on businesses, including laws against falsely advertising one’s products effects.

Imagine how different the sector would look if a similar standard were applied in the social sector. I have written before that evaluation brings truth in advertising to the social sector, but the real benefit would be to those we serve. The media story should be a secondary bi-product of a job well done. Instead, getting a good media story is the job, period.

Reporting benefits and harms

When I was in graduate school I had a fellowship that placed me with a community development financial intermediary. The organization, like most agencies in the social sector, was interested in demonstrating the effectiveness of its interventions.

I asked the executive director whether she wanted me to try to figure out what impact their work was having, or if she simply wanted me to report positive outcomes. Depending on how you look at a spreadsheet, you can make any gloomy picture look like rock-star results. To her credit, the executive director asked that I try to develop a real picture, not simply a rosy one.

But most of the pictures painted on organizations’ “Impact” sections of their websites is of the Photoshop variety. There is a lot that is wrong with the way outcomes are reported, and conflated with impact in social sector communications. One problem that I consistently see is reporting positive outcomes while neglecting to report the changes in those that experienced worse outcomes.

For example, the website of a celebrated family focused startup non-profit boasts that 20% of children in their program increased school attendance. Sounds great. But what happened to the other 80%? Did their attendance stay the same, did it get worse? And if so, by how much?

Increases always sound nice, but does any increase always outweigh a decrease? If 20% of students improved school attendance and 80% attended school less would we still consider the program a success?

Well, we probably need some more information to answer this question, information which is never provided in this type of outcomes advertising. We would at the very least need to know what an increase or decrease in attendance means. A 20% increase students attending classes sounds great, but if a kid was missing 30 days of school a year and now she misses 29, is the gain really as meaningful as it first sounded?

More importantly, what would have happened to these kids without the program? We need an estimate of their counterfactual (what the attendance of these youth would have been without this program’s intervention) in order to truly determine whether we think this increase is reason to celebrate (or the possible decrease for a portion of the other 80% is cause for alarm).

Ultimately this comes down to a question of what I call reporting believability. Most of the impact results I have seen on organizations’ websites are simply not believable, as they tend to claim outrages and unsubstantiated gains.

But these unsubstantiated claims of impact are big business. And if the social sector wants to truly move toward evidenced based programming, we need to figure out how to make it more profitable for organizations to report credible data instead of fantastical folly.

Too many indicators means a whole lot of nothing

Organizations have a tendency to want to collect every data point under the sun. I cannot tell you how many agencies I have contracted with that aim to collect things like social security numbers and criminal histories when these data points carry no decision relevance, and don’t factor anywhere into the services they offer.

Even if organization executives are not concerned with putting those they serve through exhaustive questionnaires, they should be concerned about how overburdening front-line staff with administering lengthy intakes decreases data integrity. I have long advised my customers to keep their survey instruments short and to the point. The shorter your intake, the more likely you are to have every question answered. And if you are only asking a limited number of questions, every question should have been well thought out and be clearly relevant to decision making.

I’m in the process of working with some organizations to redesign their intake forms. One organization I’m working with was attempting to track over 300 indicators. Back in the original intake design phase the thinking was (as is common) that the more data you have the better. In hindsight, my customer realized that trying to collect so many indicators overlooked the implementation reality; it’s a lot easier to say what you want than to go out and get it.

The following histogram shows the number of questions on the y-axis by the number of times those questions were answered on the x-axis over a year for this particular organization. Half of the questions were answered about ten times, with one-third of questions never being used.

To be clear, this is not a case where the front-line staff was not collecting any data at all. There were a handful of questions with around 3,000 answers, and a reasonable number between 500 and 1,500 answers. The questions with the most answers were indicators that every front-line staffer found important, such as race and sex. The reason the question answers varies so greatly is that with so many questions to answer, no staffer was going to answer them all. Therefore, each staff person used her or his own judgment as to which questions were important to answer.

With so many holes in this data set, it’s hard to draw much insight. To avoid running into this problem, organizations should tie each question directly to an outcome in their impact theories. This discipline helps prevent “question-creep”, where new questions are asked out of curiosity rather than what actions can be taken with that feedback. Second, get front-line staff involved in the intake design process to ensure that all the data they need is being collected and that the questions, as worded, are practical and collectable.

Please stop developing websites that list nonprofits

Yet another website that catalogues non-profits was released into the wild earlier this week, as the Laura and John Arnold Foundation launched the Giving Library. For a sector that disdains duplicating efforts, maintaining online directories of non-profit organizations is a fairly crowded market. What all of these efforts to create proprietary listings of non-profit organizations have in common is an imminent threat of extinction by Google, which has a pretty serious competitive advantage in the indexing game.

Developing listings of nonprofits is not necessary with modern search tools. While it is fairly trivial to find an organization that wants to take a donor’s money, it is far more difficult to identify organizations that one believes will maximize their charitable dollars. To be fair, organizations like Great Nonprofits and Charity Navigator are attempting to solve this bigger problem of helping donors identify effective agencies to invest in, although neither provides compelling analysis to functionally move outside the sphere of simple cataloguing (yet).

Indeed, the growth in influence of GiveWell underscores the value of analysis over indexing. The problem of course is that the deeper analytic approach is research intensive, and does not scale. Therefore, we are instead bombarded with superficial efforts to simply create nonprofit listings or develop laughably linear four star rating systems.

And where is the evidence that donors need a website dedicated to listing non-profit organizations? Results of the 2012 Millennial Impact Report suggest that, at least among web-savvy donors between the ages of 20-35, people are perfectly capable of learning about non-profits through organizations’ websites, newsletters, and social media channels without the assistance of intermediaries.

Which brings us back to the analysis problem. Donors do not need help finding organizations, they need help selecting organizations based on their evaluative criterion. GiveWell simplifies the process for a certain set of donors by articulating their own criterion and providing investment advice to donors who are inclined to adopt GiveWell’s utility framework.

The more difficult issue then is to develop ways of matching donors to effective organizations that address issues that are consistent with the donor’s own values. This is a matter of substantive impact evaluation and donor utility elicitation. Neither of which have anything to do with hiring a web design firm to throw up yet another nonprofit digital dumpster.


How to select measurable outcomes

At the end of last month I wrote a post advising organizations to setup their evaluation frameworks prior to advertising their greatness to the world, as credible evidence to back up claims of effectiveness is infinitely more persuasive than the alternative. In the post I gave an example of a consultant who helped a former customer of mine create a logic model with poorly defined outcomes, positioning my customer for advertisement rather than measurement. From the post:

One such consultant, who I was later hired to replace to clean up his mess, outlined a data collection strategy that included an outcome of “a healthy, vibrant community free of crime where all people are treated with respect and dignity.”

What an operationalization nightmare. How the heck do you measure that? You don’t. And that’s the point. The logic model was not a functional document used for optimizing results. Instead, it was an advertising document to lull unsavvy donors into believing an organization is effective in the absence of evidence.

Fair criticism, I think (obviously). Well, yesterday Jennifer Banks-Doll asked an even fairer question:

David, I really enjoy your blog and love the example you have given above about the unmeasurable outcome.  So true and so common!  But I feel like you’ve left us hanging…What kind of outcomes would you recommend instead?  Can you give us an example or two?

This is an excellent question, and while I did my best to give an example of the right type of thought process an organization should go through to operationalize outcomes, unfortunately the best answer I can really give is the unsatisfying “it depends”.

While there are those who tend to think through their logic models from left to right, that is, if I do this I expect that, I tend to think the other way around, starting with the goal an organization aims to achieve and moving backwards. There is nothing wrong with beginning the ideation process and goal setting with seemingly unmeasurable ideals. However, what begins as pie-in-the-sky should not be left there.

In my response to Jennifer’s question, I gave the example of an organization trying to create a “safe” neighborhood. Safety is an inherently abstract notion. Part of safety might be actual incidences of crime, but safety might also have to do with perceptions as well. If someone feels unsafe in an area free of crime, is that area “safe”? Well, it depends on how an organization defines its goals.

In the process of creating more exacting definitions of the change you want to see in your target population, you get closer to identifying measurable indicators. The goal is not to create one measure for “safety”, or whatever your outcome of interest is. Instead, what we want to do is come up with a few measures that collectively approximate what our otherwise abstracted outcome is.

One other note on this point. People tend to think that “measurable” necessarily means inherently enumerable. Working again with our safety example, one measure of safety might be crime rates, but another measure might be an opinion survey of residents in an area, asking them if they feel safe. If I had to choose one, my preference would be for the perception survey over the crime data.

Crime statistics, while numeric by nature, are not necessarily pure measures of “safety”. If arrests increase in a neighborhood, this might be due to an increase in criminal activity, but it also might mean there are more police patrolling the area. Indeed, in some of the community development work I have done, I have found (in some cases) that perceptions of safety actually increase with the number of reported crimes.

The point I am trying to illustrate is that organizations should not feel pressure to use public data or seemingly more quantitative metrics if those are not good measures of your intended outcomes. The best indicators are those which most closely approximate measures of the change you want to see in the world. If the best way to get that data is asking people how they feel about a particular issue, then by all means, hit the streets.

Snake oil nonprofit consultants sell outcomes as impact

Google’s advertising algorithm knows me too well. Pretty much the only advertisements I see now are for non-profit services. I tend to click through to these advertisements as a way of checking the pulse of social sector offerings outside the typical circles I operate in.

Yesterday I clicked on this advertisement for Apricot non-profit software by CTK. The advertisement includes a product demonstration video for their outcomes management system in which the narrator conflates outcomes with impact ad nauseam.


Outcomes is not another word for impact. An outcome refers to a target populations’ condition, impact is the change in that condition attributable to an intervention.

While we all tend to be saying the same buzz words (“manage to outcomes”, “collective impact”, etc.) we lack uniform agreement on what these terms mean. In the case of outcomes and impact, these are terms that come from the evaluation literature, and are (hopefully) not open to the manipulation of social sector consultancies with more depth in marketing than social science.

There are some that believe helping an organization at least understand its outcomes is a step in the right direction. I count myself as one of them. But telling an organization they can infer something about impact and causality by simply looking at a distribution of outcomes is not only irresponsible, it is downright dishonest.

The promise of metrics is to help the social sector move toward truer insight, not to use data to mislead funders. Whether the persistently misleading misuse of outcomes metrics is intentional or the result of ignorance, it has no place our work, and only stands to derail the opportunity we all have to raise the standards of evidence in our sector.

Funder mandated outcomes requirements create perverse incentives for implementing organizations

On face, funding agencies providing grants contingent on implementing organizations meeting outcomes objectives seems sensible. After all, the sector is moving toward an era of smarter philanthropy and impact oriented giving, so shouldn’t funders demand outcomes for their money?

Kind of.

I have been aware of this problem for some time, but was recently faced with an ethical dilemma when a customer asked me to help them adjust their program offerings to better achieve funder required outcomes.

The organization I was working with provides employment services to low-income individuals. Their grant stipulated that a certain number of program participants had to get placed in jobs in a given period of time. At first glance this seems to be a simple optimization problem where the employment program wants to maximize the number of people placed into employment.

Given this simplistic directive (place as many people as possible into employment), the optimization is actually quite trivial; serve the people most likely to get employment and ignore the hard to serve.

In this case, the hard to serve also tend to be those who need employment assistance the most. People with children might be harder to place into employment, given childcare needs, yet these individuals arguably need employment more than the otherwise equivalent person without dependent children.

Similarly, better educated people are easier to place into jobs than those with less education. But better educated people are more likely to find employment irrespective of the program intervention. This, of course, highlights the difference between outcomes and impact. The grant required improvements in outcomes, that is, the number of people placed into employment. When the focus is on outcomes, simple rationality dictates finding the most well-off persons who still qualify for services, and serving those folks first at the detriment of those who are harder to serve but in greater need.

While serving those who are better off first is rational given the way outcomes thresholds are typically written in grants, it probably does not lead to the desired outcome of the implementing organization (nor the intended outcome of the funder either!).

Hence my firm’s ethical dilemma. By wanting our assistance in optimizing outcomes according to my client’s funder’s guidelines, our client was unwittingly asking us to help them identify and serve only those who didn’t need their help that much to begin with. Our mission is to help organizations use metrics to help people better, and in this case, a data oriented approach given a misguided objective would likely lead to under-serving a hurting demographic.

Of course, this mess is nothing new, and is outside the control of implementing organizations. Funders requiring meaningless metrics of implementing organizations is not news. However, as funders try to press their grantees for more results, I am concerned that funders with a poor grasp of the difference between outcomes and impact, and insufficient knowledge to properly operationalize social indicators, will force implementing agencies to act in financially rational ways that end up hurting their target populations.

The answer to this problem is better data literacy in the social sector. My practice to date has focused on the data literacy of implementing agencies, but I’m worried that the zeal for more proof of social impact has underscored an open secret; both our front-line and grant-making institutions have a limited capacity to use data effectively.

And to those who say that some data is better than no data, I would argue that data is more like fire than we tend to realize. Fire has done incredible things for humanity, but those who do not know how to use it are likely to burn themselves.

Operationalize, optimize, then advertise your outcomes

In the for-profit sector, profits are the bottom line. Yet companies spend considerable amounts of money trying to figure out whether their products work as advertised, whether their customers are happy, and what they can do to improve quality and customer satisfaction. While advertising drives sales of an effective product or service, smart organizations are careful to first figure out how to measure the effectiveness of their offerings with their target consumers (operationalization), optimize based on customer feedback, and then advertise.

In the social sector, we make up an intervention and then skip straight to advertising the hell out of it. No measurement plan, no operationalization of desired outcomes, and certainly no optimization (an impossibility if we aren’t measuring our effectiveness to begin with).

While we are loath to operationalize, optimize, or do anything that rhymes with “evaluation”, we love it when it rains advertisers. Our advertisers come in the form of grant writers, social-media-for-social-good consultants, and pretty much anyone willing to work on a retainer to tell an organization it’s fantastic.

But how fantastic can we be without sensible data collection strategies? And how much can we improve if we continue to offer the same intervention year after year without improvements? The assumption is that the continued existence of an intervention is sufficient proof that a service is working and is valuable. Of course, this is the simple fallacy that always leads to the poor getting screwed, especially when their choice is between bad services and nothing.

The emphasis on advertising stems from agencies’ survival instincts. Indeed, the primary function of any organization is to continue to exist. I get that. But the irony is that funders and donors are begging for any organization to step up with reliable metrics and believable outcomes.

As funders and donors have started to demand more evidence of impact from organizations, the usual suspects of advertising consultants have shifted their rhetoric (but not offerings) to appear more in line with the shifting philanthropic landscape. All of a sudden, non-profit marketing consultants with backgrounds in Russian literature and interpretive dance are qualified to help organizations craft logic models and develop rigorous data collection strategies.

One such consultant, who I was later hired to replace to clean up his mess, outlined a data collection strategy that included an outcome of “a healthy, vibrant community free of crime where all people are treated with respect and dignity.”


What an operationalization nightmare. How the heck do you measure that? You don’t. And that’s the point. The logic model was not a functional document used for optimizing results. Instead, it was an advertising document to lull unsavvy donors into believing an organization is effective in the absence of evidence.

The good news is that donors and funders are starting to get wise to the backward thinking “advertise first” mentality. The social sector is shifting, for the better, to reward organizations that take their data collection plans seriously, and who look to improve on their impact rather than simply advertise it to anyone willing to listen.

Organizations hoping to enjoy fundraising success in the future would be wise to invert their funding strategy to a model that emphasizes operationalization and optimization of outcomes first. In this new era of philanthropy, without evidence of impact, your advertising partners won’t have anything to sell.

Service rationing and social welfare maximization

It’s fairly typical for direct service organizations to lament that demand for services out paces supply. In the free market when demand exceeds supply sellers increase prices, as the seller’s focus is maximizing profits.

In the social sector, our focus is maximizing social benefit, yet outside of the medical world (where triage dictates the worst of individuals capable of survival get treated first) there is no such rationalization around service rationing.

Through my work, I visted with two programs recently, one that supplies affordable housing vouchers for homeless individuals and another that places low-income youth in summer jobs with stipends. Both organizations face greater demand for services than they can afford, meaning they each have to turn a significant number of people away.

I asked both program directors how they decide who receives services, and in both cases they simply service people on a first-come, first-served basis.

The problem with first-come, first-served is that those who show up for services first might not be the best fit to maximize social welfare. Going back to the example of the for-profit world, if you have ten tickets to a concert to sell, and the first five people in line are willing to pay a dollar, and the next twenty are willing to pay thirty dollars each, you would deny the first five people and select the next twenty.

But in the social sector, when we select people on a first-come, first-served basis, we are willing to serve those who might need the service less, or be less likely to succeed, than if we selected someone else.

The difficulty is that we think about service rationing in terms of eligibility instead of social welfare maximization. If someone qualifies for our program, they are in, even if other indicators suggest that the selected individual might drop out of the program soon after enrollment, or that her or his need for the service is far more modest than the person behind them in line.

In order to maximize over social welfare, we have to define what social welfare means to us. This definition stems from the outcomes identified in your logic model’s impact theory. In the case of the affordable housing program, the desired outcome might be to minimize the number of years of life lost due to homelessness. Under this framework, we would have a preference for serving individuals who are at a greater risk of lower life expectancy due to their homelessness, than simply taking every person who is homeless until vouchers run out.

The distinction I’m making here is eligibility (as being homeless makes you eligible), and what our social welfare maximization framework is. The social welfare maximization framework gives us a way of prioritizing service delivery between two individuals who otherwise qualify for services.

In the case of the youth workforce development program, while all low-income youth would qualify for services, we might have a preference for placing people into the program who are likely to complete the internship. In this case, one could use historical data to fit a predictive model that provides some insight into what characteristics made an individual more or less likely to have completed the program in the past. Under this framework, social welfare maximization would involve not only placing people into the program, but maximizing the number of people in the program who complete the internship.

Supply and demand issues have long plagued the social sector, in both economic booms and busts. Therefore, we need to be smarter about how we allocate our scarce resources. The first step in better allocation of resources is a well defined impact theory that clearly identifies an organization’s intended goals. From there, one can develop a utility maximizing framework that learns from historical data to better optimize allocations through time.