Posterous theme by Cory Watilo


The Future of Marketing and Advertising Belongs to Software

Since day 1 I've described what we are building as "a web technology for marketing and advertising - not an advertising and marketing technology for the web.” Of course it's a play on words but the purpose is to more clearly define our product. It is software. As we begin to open source some of the tools we have created we are reminded everyday that Yieldbot is a software company. That’s a good thing.

I ventured into display advertising because it had a weak technology stack supporting a pre-digital business model. The de-facto intelligence in display advertising is a 1-10 ranking system – the ad server waterfall - with a single unit of measurement – the impression - that was in fact always different. From a software perspective, display advertising is a massive of opportunity.

My web software experience was first in e-commerce where I watched amazing software be created by the likes of ATG and Endeca. Then in Search where I watched Google and Yahoo employ thousands strong armies of engineers. Most recently at Offermatica/Adobe Test&Target where the software serves billions of highly optimized and dynamic web experiences every week. Software. Software. Software.

Michael Walrath, Founder of RightMedia said recently:

“In order to build a truly disruptive and highly valuable company delivering enterprise software for digital advertising, the new solution has to be an order of magnitude better than the existing systems.  It is not enough to deliver an incrementally better version of the existing systems.  If there is to be a resurgent disruptor in the advertising technology space it has to change the game. It must attack the white space…”

What I love about this quote is that it frames the market opportunity as enterprise software and software that must do something where nothing has been done before. 

Yieldbot attacks this challenge everyday. Massive batch and realtime and predictive analytics. Machine-learning and automated intelligence. Differentiated and highly dynamic units of measurement.  The visualization of data and the ability to make it actionable. A white space where the focus is not on buying or selling media – but on how well media and people can be matched in realtime. 

Matching differently. That is our disruption.

The enterprise software I admired and mentioned above all looked to solve the matching problem. Display advertising’s main problem as I wrote 2 years ago is the only place where the “order of magnitude better than existing systems” can be achieved. This is because new, more intelligent methods of matching can fundamentally revalue the media around something besides impressions and cookies. We believe that something is realtime visit intent.

It’s an amazing time to build software. There is more technology to get more understanding and create more intelligence at a lower cost than ever before. The advances in analytics, databases and the languages create an order of magnitude more power. I couldn’t think of anything more exciting to be working on in this day an age than software or a better group of people to be doing it with. The future of marketing and advertising belongs to software. 

 

Introducing Pascalog

Media_httpfarm7static_bgjxj

(Shared under Creative Commons Attribution-ShareAlike license: Flikr user Timitrius)

Today, the dev team at Yieldbot is excited to announce plans to open source one of our prized internally developed technologies: Pascalog.

Technology often evolves more in cycles than linearly, with past patterns showing through as more recent innovations are made.

For a while we were doing all of our analytics in Cacsalog, and things were going great. As a Clojure DSL written on top of the Hadoop Cascading API, Cascalog is a brilliant technology for efficiently processing large data sets with very tersely written code.

In fact, we even wrote about those experiences here and here.

But we found ourselves writing things like this:

(<- [!pub !country !region !city !kw !ref !url ?s]
    (rv-sq !pub !country !region !city !kw !ref !url ?pv-id ?c)
    (c/sum ?c :> ?s))

We thought that there had to be a better way. When we realized that Clojure being a Lisp has its foundations in the 1960's we immediately realized the next logical step would be an upgrade into the 1970's.

Wouldn't we want to write something more like:

program HelloWorld;
begin
   writeln('Hello, World!');
end.

And we immediately set upon bringing the best of software development of the 1970's, Pascal, into the Big Data world of the 2010's. Pascalog was born.  (who couldn't love a language that wants you to end your programs with a "."?)

This also fit well with internal discussions we were having at the time lamenting the complexity of managing a Hadoop cluster and the efficiencies that might be gained by combining all the functionality back into one processing environment on a mainframe. That dream is on hold until we find a suitable hardware vendor, but there was certainly no reason to hold Pascalog development back for that.

Data is a readln() Away

In Pascalog we've done the heavy lifting. By adapting readln() to be bound to a Cascading Tap, you read data in the way you've done since your Turbo Pascal days.

It didn't take us long to realize that you'd want to save the results of your calculations somewhere, so in a followon version we added the mapping of writeln() to an output Cascading Tap.

Configuring your input and output taps and mapping them to readln() and writeln() is as easy as configuring an INI file.

An upcoming version which should be available shortly will also allow the readln() of one Pascalog program to be mapped to the writeln() of an upstream Pascalog program, allowing you to daisychain your Pascalog programs.

Why Pascal?

We make it sound above like we jumped onto the Pascal bandwagon right away, but in truth we considered several alternatives from the 1970's.

Of particular interest was the ability to write nested procedures. We've grown accustomed to this from our Python development on other parts of the platform and this allows us to migrate between the two worlds more seemlessly (compared to, say, Fortran).

The availability of a goto statement is also a great feature to bail you out if you start getting a little too lost in your control flow. This has become a lost art.

We did consider C, but couldn't get over the hump of having it named "Clog".

The Future

We're furiously looking for a Pascal Meetup group where we can make a live presentation. If you know of one, please let us know!

We have a long list of features in mind to build, but we also want to hear back from the community.

Visit www.pascalog.org to get started! We're looking forward to the pull requests. If you have live questions there's usually one of us hanging out on CompuServe under user ID [73217, 55].

 

Development as Ops Training

Media_httpwwwartandar_aictg

It's become failrly well understood that "Dev" and "Ops" are no longer separate skill sets and are combined into a role called "DevOps". This role has become one of the hottest and hardest to fill.

At Yieldbot we've taken a pretty hardcore approach to putting together Dev and Ops into DevOps that serves us well and should be a great repeatable pattern.

Chef + AWS Consolidated Billing

The underlying philosophy we have is that the development environment should match as closely as possible the production environment. When you're building an analytics and ad serving product with a worldwide distributed footprint that can be a challenge.

Our first building block is the use of Chef (and on top of that ClusterChef, which is now Ironfan). Using these tools we've fully defined each role of the servers in a given region (by defining as a cluster), and all of the services that they run. We coordinate deploys through our Chef server with knife commands, and Chef controls everything from the OS packages that get installed, to the configuration of application settings, to the configuration of DNS names, etc.

The second building block is that every developer at Yieldbot gets their own AWS account as a sandbox. We use the AWS "Consolidated Billing" feature to bring the billing all under our production account. This lets us see a breakdown of everybody's charges and means we get one single bill to pay.

The last detail is that every developer uses a unique suffix that is used to make resource references unique when global uniqueness is necessary. This is mostly used for resolving S3 bucket names. For any S3 bucket we have in production such as "foo.bar", the developer will have an equivalent bucket named "foo.bar.<developer>".

Doing Two Things at Once

With all of that as the status quo, developers are almost always doing two things: developing/testing (the Dev), and learning/practicing how the platform is managed in production (the Ops).

Everyone has their own Chef server, which is interacted with the same way that the production Chef server is. As they deploy the code they are working on into their own working environment, they're learning/doing exactly what they would do in production.

All of this was put in place over the last year while the developement team was static, during which time we switched from Puppet to Chef.

But the power of this approach really hit home recently as we've started to add more people to the team.  The first thing a new hire does is go through our process of getting their development environment set up. There's still bumps along the way, and they get problems and take part in ironing them out. The great thing about this approach though is that each bump is a lesson about how the production environment works and a lesson in problem solving in that environment.

The Differences

Having said all that, there are a couple differences that we've put in place between consciously development and production, with the driving force being cost.

The instances are generally sized smaller, since the scale needed for production is much greater.  Amazon's recent addition for support of 64-bit on the m1.small was a great help.

We use several databases (a mix of MongoDB, Redis, and an internally developed DB tech) that are distributed on different machines in production that we collapse together onto a single instance with a special role called "devdb".

More

We'll have to have some future blog posts about how we import subsets of production data into development for testing, and the like.

We also use Chef with ClusterChef/Ironfan for managing the lifecycle of our dynamic Hadoop clusters. Yet another good topic for a post all its own.

Have experience with a similar approach or ideas about how to make it even better? We want to hear about it.

 

 

Realtime Kills Everything

Our first ad campaigns are live and the results are exciting. The campaign ran on a premium publisher in the women’s lifestyle vertical and beat the publisher’s control group on Click- Through-Rate (CTR) by 77% on the 728 x 90 unit and 194% on the 300 x 250. There were over 1M impressions in the campaign served on this domain over a 2-week period. Yieldbot is now serving the entire campaign.  

Most exciting to us are some of the individual results:

  • The best performing keyword has a CTR of 1.56%. 
  • The best creative unit (a 300 x 250) is getting 1.01%

We are running IAB standard banner units. This is not text. This is not rich media.

According to MediaMind the industry average CTR for the campaign vertical is 0.07%

The most matched keyword intent has a CTR of .43%. It also has a CPC of $5. 

That math works out to an eCPM of $21.44. That’s pretty exciting stuff. Even more so when you factor in that this campaign is running in what was unsold inventory.

When I shared the results with one of our Board Members he asked me, what at the time I thought was a simple question. “Why are the results so good?” But then, I actually had to think hard about the answer. I had to boil down a year of beta testing and then another year of building a scalable platform into what deserved to be a simple answer.

Realtime.

Realtime was my one word answer. Never before was every page view of intent for this publisher's visitors captured in realtime - let alone used to make a call to an ad server at that very moment.

Realtime is different. Realtime kills everything before it. As such, Yieldbot is not building ad technology for the web. We are building web technology for ads. Since nothing is more important for advertising success than timing it makes sense that nothing is more valuable for results than realtime.

Realtime was a big buzzword for a while but the hype has died down. That’s good. In the Hype Cycle we’re now somewhere moving from the “Trough of Disillusionment” to the “Slope of Enlightenment.” It is however this ability of the web to react in realtime that makes the future of the medium so exciting. 

Twitter of course is the best representative example. Twitter changed everything about media that came before it. Used to be that breaking the story was the big deal Now, even online news seemed stodgy compared to people giving realtime updates that planes have landed on rivers, people being killed and opining on a live show right along with it. 

As technology continues to get better at processing the trillions of inputs from millions of people going about their daily lives - doing everything from riding their car to work, buying a pack of chips, surfing the web – the web will respond in realtime. Because of that it will be relevant. The idea of an ad campaign will seem like owning a 32 volumes set of Encyclopedia Britannica. Everything becomes response because the technology is responsive. Calculations need inputs. The web will be measuring just about everything you do and know the moment you are doing it. Nothing will be sold. Everything will be bought.

It’s that realtime pull that creates these new valuations of the media. That new value of the media is what we have been working to create at Yieldbot. That is why these results are so exciting. Best of all, we’re just getting started. We’ve got a bunch of new campaigns about to get underway and we’re only going to get smarter and more relevant. We’ll continue to keep you posted on how it’s going and if you're running Yieldbot you'll know yourself. In realtime.

 

Relevant News

Muskets

"The enemies of advertising are the enemies of freedom.” - David Ogilvy

Exciting news for Yieldbot and lovers of relevance today as we’re announcing a new Series A round of funding led by New Atlantic Ventures (NAV) and RRE Ventures.  Seed Investors kbs+p Ventures, Common Angels and Neu Venture Capital also participated again in this round.

The funny thing about raising money in media technology is that very few VC’s actually understand it and even fewer have vision for where it’s headed. We’re fortunate to bring together a team of investors that live and breathe this stuff and proudly represent New York’s media leadership and Boston’s technology leadership in a way that mirrors Yieldbot’s own corporate footprint.

The funds will be used to continue development and bring to market our Yieldbot or Publishers (YFP) realtime intent-graph™ technology (launched July 2011) and our Yieldbot for Advertisers (YFA) realtime intent marketplace that launched in alpha this month. Together YFP and YFA create a valuable media channel of realtime consumer intent that delivers an order of magnitude more relevant ad matching and performance. 

From day one, two years ago, we wanted to bridge the largest digital inventory source, Web Publishers, with the largest and best digital ad spends, Search advertisers, in a way the brings a more relevant web experience to people. We’ve progressed an extremely long way with a small team and relatively little funding so far. Today we’re putting dry powder in our muskets and continuing to battle. The enemies of freedom are only so because they know not relevance.

 

Working at Yieldbot

We're adding more developers to our team and pushing things to the next level. If you like seriously interesting and challenging work in the areas we're looking for help in, you should be talking to us. You'll have a single-digit employee number, so you'll be getting in early and powering us on our way to fulfilling the huge potential we're sitting on.

What can you expect if you decide to jump in and join us on our mission to make the web experience more relevant? For one thing, no shortage of interesting hard problems to solve, and the latest tools to do it with.

A Great Environment

For our devleopment environment we each have an AWS sandbox that deploys the same code as production, so everyday work is production devops training too, with a Mac for your local dev environment. You'll have Campfire group chat up all day, and be in the middle of all the important conversations around what we need to do and how we need to do it, from the CEO on down.

The language and tools you use most during the day will depend on what part of the platform you're focusing on.

A Distributed Realtime Platform

A large part of the core platform is in Python. All of the code around scheduling of tasks and managment of the platform are found here, as well as the key ad serving logic and realtime event processing. You'll be making use of MongoDB, redis, and ElephantDB. You'll be solving problems on running the platform distributed across several data centers worldwide. You'll likely be doing some devops stuff here too, and loving the ease with which Chef lets you get that done.

Bleeding Edge Analytics

If you're working on our analytics then you are loving the use of Cascalog (a Clojure DSL that runs over the Cascading API on Hadoop). The power-to-lines-of-code ratio here is ridiculous. More than that, you'll be writing realtime analytics in Storm. That's not cutting edge, it's definitely bleeding edge.

Focus on UX

To work on the UI you're pushing the limits on the latest Javascript UI tools like D3.js and Spine.js. Have you thought about how clean client-side MVC should be done? Spine is it. We're serious about quality of UX here. If you're serious about it too, this is where you should be.

An Awesome Team

The team you'll be joining has been there before. We've founded and built successful products, platforms, and companies. We know our industry and what it takes to be successful. And we're doing it.

The most important thing that keeps us developers here at Yieldbot energized is that we're building something people want, that's been clear from the beginning. Our mission to make the web experience more relevant resonates with users, publishers, and advertisers.

If you're up for the challenge contact us at jobs@yieldbot.com. Check out http://www.yieldbot.com/jobs. We have some seriously challenging work you can get started on right away.

 

 

How Yieldbot Defines and Harvests Publisher Intent

The first two questions we usually get asked by publishers are:

1) What do you mean by “intent”?

2) How do you capture it?

So I thought it was time to blog in a little more detail about what we do on the publisher side. 

The following is what we include in our Yieldbot for Publishers User Guide.

Yieldbot for Publishers uses the word “intent” quite a bit in our User Interface. Webster’s dictionary describes intent as a “purpose” and a “state of mind with which an act is done.” Behavioral researchers have also said intent is the answer to “why.” Much like the user queries Search Engines use to understand intent before serving a page, Yieldbot extracts words and phrases to represent the visitor intent of every page view served on your site.

Since Yieldbot’s proxies for visit intent are keywords and phrases the next logical question is how we derive them. 

Is Yieldbot a contextual technology? No. Is Yieldbot a semantic technology? No. Does Yieldbot use third-party intender cookies? Absolutely not!

Yieldbot is built on the collection, analytics, mining and organization of massively parallel referrer data and massively serialized session clickstream data. Our technology parses out the keywords from referring URLs – and after a decade of SEO almost every URL is keyword rich - and then diagnoses intent by crunching the data around the three dimensions of every page-view on the site. 1) What page a visitor came from 2) what page a visitor is about to view and 3) what happens when it is viewed. 

Those first two dimensions are great pieces of data but it is coupling them with the third dimension that truly makes Yieldbot special. 

We give our keyword data values derived from on-page visitor actions and provide the data to Publishers as an entirely new set of analytics that allow them to see their audience and pages in a new way – the keyword level. Additionally, our Yieldbot for Advertisers platform (launching this quarter) makes these intent analytics actionable by using these values for realtime ad match decisioning and optimization.

For example: Does the same intent bounce from one page and not another? Does the intent drive two pages deeper? Does the intent change when it hits a certain page or session depth? How does it change? These are things Yieldbot works to understand because if relevance were only about words, contextual and semantic technology would be enough. Words are not enough. Actions always speak louder.

All of this is automated and all of this is all done on a publisher-by-publisher level because each publisher has unique content and a unique audience. The result is what we call an Intent Graph™ for the site with visitor intent segmented across multiple dimensions of data like bounce rate, pages per visit, return visit rate, geo or temporal.

Here’s an example of analytics on two different intent segments from two different publishers:

Screen_shot_2012-01-02_at_9

Screen_shot_2012-01-02_at_9

For every (and we mean every) visitor intent and URL we provide data and analytics on the words we see co-occurring with primary intent as well as the pages that intent is arriving at (and the analytics of what happens once it gets there). We also provide performance data on those words and pages.

Yieldbot’s analytics for intent are predictive. This means that the longer Yieldbot is the site the smarter it becomes - both about the intent definitions and how those definitions will manifest into media consumption. And soon all the predictive analytics for the intent definitions will be updated in realtime. This is important because web sites are dynamic “living” entities - always publishing new content, getting new visitors and receiving traffic from new sources. Not to mention people’s interests and intent are always changing. 

I hope this post has served a good primer on Yieldbot for Publishers and maybe even gotten you interested in seeing it in action on your site. One of the best parts of what we do is seeing people’s faces when they first see the product. If you are a publisher and would like a demonstration please email info <at> yieldbot.com

 

Serendipity Is Not An Intent

Serendipity-unexpected

Wired had two amazing pieces on online advertising yesterday and while Felix Salmon’s piece The Future of Online Advertising could be Yieldbot’s manifesto it is the piece Can ‘Serendipity’ Be a Business Model? that deals more directly with our favorite topic, intent.

The piece discusses Jack Dorsey’s views on online advertising and where Twitter is going with it. I had a hard time connecting the dots.

“…all of that following, all of that interest expressed, is intent. It’s a signal that you like certain things,” 

Following a user on Twitter is not any kind of intent other than the intent to get future messages from that account. If it’s a signal that you like certain things it’s a signal akin to the weak behavioral data gleaned from site visitations.

Webster’s dictionary describes intent as a “purpose” and a “state of mind with which an act is done.” Intent is about fulfilling a specific goal. Those goals fall into two classes, recovery and discovery.

Dorsey goes on:

When it (Google AdWords) first launched, Dorsey says, “people were somewhat resistant to having these ads in their search results. But I find, and Google has found, that it makes the search results better.”

At the dawn of AdWords I sat with many searchers studying their behavior on the Search Engine Results Pages. What I and others like Gord Hotchkiss who also studied searcher behavior at the time learned, was people were not as much resistant to Search Ads as they were oblivious to them. People did not know they were ads!

Search ads make the results better because they are pull. Your inputs into the system are what pull the ads. So how does this reconcile with the core of Twitters ad products that are promotions? Promos need scale to be effective. Promos are push. Precisely the opposite of Search where the smallest slices of inventory (exact match) produces the highest prices and best ROI.

Twitter is the greatest discovery engine ever created on the web. But discovery can be and not be serendipitous. Sometimes, as Dorsey alludes to, you discover things you had no idea existed. More often, you discover things after you have intent around what you want to discover. This is an important differentiation for Twitter to consider because it’s a different algorithm. 

Discovery intent is not an algo about “how do we introduce you to something that would otherwise be difficult for you to find, but something that you probably have a deep interest in?” There is no “introduce” and “probably” in the discovery intent algo. Most importantly, there is no “we.” It’s an algo about “how do you discover what you’re interested in.”

Discovering more about what you’re interested in has always been Twitter’s greatest strength. It leverages both user-defined inputs and the rich content streams where context and realtime matching can occur. Just like Search.

If Twitter wants to build a discovery system for advertising it should look like this.

Using Lucene and Cascalog for Fast Text Processing at Scale

Here at Yieldbot we do a lot of text processing of analytics data. In order to accomplish this in a reasonable amount of time, we use Cascalog, a data processing and querying library for Hadoop; written in Clojure. Since Cascalog is Clojure, you can develop and test queries right inside of the Clojure REPL. This allows you to iteratively develop processing workflows with extreme speed. Because Cascalog queries are just Clojure code, you can access everything Clojure has to offer, without having to implement any domain specific APIs or interfaces for custom processing functions. When combined with Clojure's awesome Java Interop, you can do quite complex things very simply and succinctly.

Many great Java libraries already exist for text processing, e.g., Lucene, OpenNLP, LingPipe, Stanford NLP. Using Cascalog allows you take advantage of these existing libraries with very little effort, leading to much shorter development cycles.

By way of example, I will show how easy it is to combine Lucene and Cascalog to do some (simple) text processing. You can find the entire code used in the examples over on Github.  

Our goal is to tokenize a string of text. This is almost always the first step in doing any sort of text processing, so it's a good place to start. For our purposes we'll define a token broadly as a basic unit of language that we'd like to analyze; typically a token is a word. There are many different methods for doing tokenization. Lucene contains many different tokenization routines which I won't cover in any detail here, but you can read the docs ot learn more. We'll be using Lucene's Standard Analyzer, which is a good basic tokenizer. It will lowercase all inputs, remove a basic list of stop words, and is pretty smart about handling punctuation and the like.

First, let's mock up our Cascalog query. Our inputs are going to be 1-tuples of a string that we would like to break into tokens.

I won't waste a ton of time explaining Cascalog's syntax, since the wiki and docs are already very good at that. What we're doing here is reading in a text file that contains the strings we'd like to tokenize, one string per line. Each one of these string will be passed into the tokenize-string function, which will emit 1 or more 1-tuples; one for each token generated.

Next let's write our tokenize-string function. We'll use a handy feature of Cascalog here called a stateful operation. If looks like this:

The 0-arity version gets called once per task, at the beginning. We'll use this to instantiate our Lucene analyzer that will be doing our tokenization. The 1+n-arity passes the result of the 0-arity function as it first parameter, plus any other parameters we define. This is where the actual work will happen. The final 1-arity function is used for clean up.

Next, we'll create the rest of the utility functions we need to load the Lucene analyzer, get the tokens and emit them back out.

We make heavy use of Clojure's awesome Java Interop here to make use of Lucene's Java API to do the heavy lifting. While this example is very simple, you can take this framework and drop in any number of the different Lucene analyzers available to do much more advanced work with little change to the Cascalog code.

By leaning on Lucene, we get battle hardened, speedy processing without having to write a ton of glue code thanks to Clojure. Since Cascalog code is Clojure code, we don't have to spend a ton of time switching back and forth between different build and testing environments and a production deploy is just a `lein uberjar` away.

 

Recent Yieldbot Intent Streams Related to Steve Jobs

At Yieldbot our focus is on collection, organization and realtime activation of visit intent in publisher content. We do this not as a network but on a publisher-by-publisher basis because of this simple fact; every publisher has a unique audience and unique content. What that means is that even if the keyword is the same across publishers, the intent associated with it varies in each domain. 

The original purpose of this post however was not to point out the flaws of networked based keyword buying vs the performance advantage of Yieldbot’s publisher direct model. Nor was the purpose to show you how much we truly understand publisher side intent at the keyword level and how use that intelligence in an automated way to achieve the highest degrees of relevant matching. 

The original purpose of the post was to meet the request of a few people that had asked me to share some more data visualization of our Intent Streams™ after we originally shared a few on our recent blog post about our data visualization methods.

It occurred to me the other day that the best representative example over the last month was intent around “Steve Jobs” so below we are sharing our 30-day Intent Streams™ from four publishers. 

If you’re new to our streamgraphs the width of the stream is the measure of pageviews of intent associated with the root intent “Steve Jobs.” The other useful data points in these visualizations are the emergence, increases, decreases and elimination of the associated intent over time. As well as how many terms are seen to be associated with the root intent.

Jobs_quotes
Jobs_next
Jobs_logo
Jobs_tribute

Another way we visualize intent data is across a scatter plot. Here you see the performance of the “Steve Jobs tribute” compared to the other intent related to Steve Jobs looking at the number of entrances (aka landings) on the y-axis and the bounce rate of that intent on the x-axis. 

Jobs_scatter_tribute
It’s important to note in this scatter plot visualization that the analytics are predictive. We are estimating performance forward over the next 30 days. The four streamgraph visualizations were based entirely on historical data –in their case a 30-day look back as noted on their x-axis.

We hope you find this intent data as interesting as we do.