Skip to content
Joost.blog

Joost de Valk on the impact of AI on Search Engine Optimization | Trusted Web Podcast S3E3

Duration: 46:53

Play

Joost de Valk is an entrepreneur and investor from the Netherlands who is best known for founding Yoast SEO (13 million+ users). Joost started out as a consultant and blogger in 2…

Show transcript
[Music] foreign I'm joined by a friend a Visionary and a legend in the realm of content and search engine optimization it's yoast the fog about a decade and a half ago yo started a little tool to optimize websites for search engines together with his lovely wife Marika and our incredible team they skilled it to over 13 million users made it a commercially successful business while giving back so much to the open source ecosystems that they're part of since we last spoke here America sold the yoast company to newfold digital and today Marika and Yos are Angel Investors full disclosure since 2020 their investors in wordproof too while running their beautiful family of four Jose and I share a deep passion for building a better web and as always I'm very much looking forward to this conversation Yos thank you so much for coming to The Trusted web podcast yeah well as always I'm very happy to talk to you of course and these topics are very close to my heart so thank you for having me and yeah looking forward to the conversation you've done over the last decades you've done countless Keynotes podcast media interviews mainly on the topic of search engine optimization one-on-one we had many conversations on AI as well so yeah the ID was born to do a podcast on AI and how the two topics impacts each other well I I I will say that it's also become a bit more prevalent in the last few months I mean the recent things that have happened in AI uh have a deep impact on search of course and uh yeah that that would be my opening question then what is the state of search today what is the state of AI today will the question uh following up and how will those two worlds impact each other so of course we've seen openai release chat GPT and that caused a whole lot of Buzz um everywhere but also inside Google I've had friends at Google that I was talking to tell me like yeah we were actually surprised by that it was sooner than we expected we weren't ready Etc so um Google got caught by surprise which is a new thing and then being quite quickly quickly um actually added Bing chat so they added chat GPT into Bing and I think that that's actually the most awesome way of using Jet gbt right now because it uses the newest models and it actually gives references to the open web uh when in in its responses Chad TPT doesn't source it's content so it has a lot of content of course that it bases its its answers on but it never tells you where it gets those answers from and and Bing chat actually does do that which I think is awesome and it's also showing us a new type of interface it's a new way of searching which is way more chat based which in a way is actually for the first time really the type of digital assistant that these search engines have been claiming to have for for years now for now finally they can they can answer a question and then you can follow up and it's actually a real conversation and it uses the context of your previous questions to answer the follow-up questions so it really becomes a truly a conversation and I think with that it is a truly interesting new approach to search and we'll have a very deep impact on search and um Microsoft wished to do conversational uh to do conversation science they introduce clippy in the words don't you think yeah yeah it's in many ways I think Bing chat is clippy on steroids um but it is also um it's so powerful it also still has its risks right and I I think we'll touch on those a bit today um because not everything that it outputs is true which is always something that these search engines have have prided themselves on that they were giving you the best answers on the web and suddenly some of the answers that it gives are not true they're in fact completely hallucinated as we call them and I think that that is one of the things that we need to teach everyone about Ai and about especially these large language models that were that we're now really making like one thing together with AI they're really just a part of a of the larger field of AI but what everyone's seeing now is is basically large language models at work and it's good to understand what these things do um they really predict what would be the logical thing to say as an answer to the question that you're giving but it doesn't necessarily have to be truthful so they really are generating the the almost perfect socially access acceptable response to your type of question so they're the the best bar buddies you'll ever have but that doesn't mean just like in the pub that the answer has to be true and I think that's a risk is that it is trust not part of the design of the models is that no so it is as simple as it's really when it's generating text it's really generating text it's not um bringing up truths from somewhere and generating a text around that so it will generate facts that are not facts but it will tell them to you with a style and a Vigor as though it's fully true um but it's not it wasn't intended when it was initially built to be a thing that that generated truths and and that's where where the trust becomes an issue and we and I think we'll have to figure out as Society but also these search engines themselves how they can fix that and whether we are okay with having search engines answer like that I honestly myself think that we should be regulating that a whole lot more than we are and it well that we we have a right to know what they trained those large language models on um and what we can expect those large language models to to give in in return as answers as a result of that because a large language model outputs what it has been trained on so if you put garbage into it you get garbage out of it and a lot of what it's been trained on is the web at large and honestly the webinar probably is quite garbage so um so the output of of these models is by no means inclusive is by no means um well true and I think that that is something that we have to well start uh start compromising and and start figuring and figuring out like what can we do with these things and what can we not do yeah so that about the state of AI um state of search a bit um and then is the fastest adopted technology in history 100 million users in less than what was it two months um the day before we started using chat gbt as Society or using is not the right WordPress started playing with um what was the state of search the day before we started playing with chechi PT so in many ways search has not changed all that much in the last decade um search engines have been focusing on like who or which results can we surface what what is good content what uh who do we trust Google is called this eat and they've recently added another E so it's experience um expertise Authority and trust and um these four are the things that they train their their quality raters on who they use to train their own machine learning models so they train their their quality raters to to look for signs of expertise and Authority Etc that was already important before AI suddenly became so incredibly prevalent I think it'll be even more important uh after that because suddenly we will get a deluge of content creating content that has some is suddenly super simple and I'm not saying that people should generate their content with these models because I don't think that that's necessarily a good idea yet um because of the hallucination problems but people will do that and there will be a lot of bad content on the web suddenly and I think that's one of the things that Google is really worried about as well is not just like how do we compete with Bing and make sure that we have the same interface and the same functionality but also how do we keep ranking stuff on the web when it's all suddenly so bad and there's suddenly so much more of it and also how do we train our models on content that is truly unique and that is not generated by a model itself because then you get like yeah you're basically regurgitating garbage 500 server error in finite Loop yeah yeah it is it really becomes a problematic thing so um I think the status search was very much stable for quite a while and this will actually upset it quite a bit more than uh than it has been in in the in the last few well decades I'd almost say because it really is a step change in how quickly people are able to make content and thus create a lot of it and that means that they have to adjust their models they might also have to adjust like can they store everything isn't that too much anymore at some point can they crawl everything which is a topic that's close to my heart as well because the way that they crawl the web is totally un unsustainable in in a way um and then there's the costume factor to it because running a large language model is a costly business it means continuously training these large language models and that training is super intensive it's intensive enough that in a recent Washington Post interview I I'll drop you a link for the show notes um the uh the CEO of Google was talking about the cost impact of um calculating these models this is a company that did 192 billion dollars in Revenue last year that is seriously thinking about the cost impact of running these models and I I think that has to give you pause like if they're thinking about that what what is that why is that so huge and turns out that by some estimates it's about 10 times more expensive to run in the search on AI than it is on their traditional methods so suddenly there are costs 10x and that's a problem and it's more Moore's Law um is it a will that easily change over time uh I think that'll come down a Lot quickly uh I hope so because otherwise this is entirely unsustainable they're literally using entire data centers for months now to calculate some of the models that they're using and so uh and then and basically that that model the way it works it gets trained and it's still from the moment it started training so if they take three months to train it it's almost already three months out of date by the time it's finished and well that they'll have to become better at yeah uh but because it means that they constantly have to start training new models otherwise but I do think that that will come down quickly you're already seeing some adaptations where you can run relatively simple AI versions on a laptop which has which means that well if that happens it'll the cost will come down but for now it is truly a really ridiculous cost and with that cause there that means that's so it's exclusively available training models like that to people who can finance that yeah yeah that can find us that yeah it is a well it's an expensive business I will say that openai has actually done quite a bit of good work in opening this up and open AI has a couple of methods um specifically a technique called embeddings where it allows you to um put a blob of content like say a thousand web pages from your own site yeah into openai and say okay you can only answer with content from this embedding and with that you can do absolutely awesome things so then you can suddenly start building your own AI on stuff that you trust and then these models become super powerful and and I think will actually have a very very big change on stuff that we do imagine saying you're a law firm and you have thousands of contracts that you've put out over the last decade you throw all that into your own large language model and you start using that to generate your contracts that is super powerful and will save you incredible amounts of time so um a friend of mine said when you have the steam engine it was a hundred horses open AI is a hundred into its wow and you still have to check like what is that what is coming out of that and you have to uh do a manual review but it'll save so much time and it I really think it'll make people that use it effectively super productive and I think society as a whole needs to get ready for that because if developers suddenly become 30 40 50 percent more productive we suddenly have more developers building different things which means some faster automation which means a whole lot of other things suddenly become automated and then other people start using these models too so lawyers and other people become more effective and faster what they do I don't know that we're ready for that and just on a personal level imagine uh over the last decade what you would have written 50 50 000 100 000 emails putting that in that model with all the proposals I ever wrote with all the whatever with all the deals we have for close um would you that would be mind-blowing if you have autocomplete on just your own sentences or just I think you all get there fairly quickly because that is actually relatively cheap to do because that it it's a lot of content for you but it's not necessarily a lot of content for a model yeah um one of the things I've been doing so I a tool that I really like called docspot.ai um I've actually been feeding all my own bookmarks into that so uh I just give it like it does these open AI embeddings so I I give it all my bookmarks and I make it just answer from my bookmarks and and at that point I have an AI that is tailored to me that is basically me in in a way and my knowledge but that I can question and and that's super powerful and the open AI embeds would you trust them with all your emails your bookmarks is number one but no I wouldn't do that yet no so uh this is where it becomes like the there's two discussions there that I really really think politicians need to get in on very quickly yeah um one is the Privacy aspect of this like where am I storing that the are you storing that data that you're embedding into uh into the model yeah and um and how are you using that how are you you using my data are you using my data just for me or are you using my data to drain your wider model as well and and am I okay with you doing that um that's something that we need to figure out that on that basis chat gbt has already been banned in Italy because well nobody knows yeah um is that good or bad in one sense that they try to do it like that that's all I think it's good that they banned it I don't think it's necessarily the right reason alone but I think it's good that they at least are aware of like oh wait we need to actually look at this um and then there's the other problem it's like okay so these large language models are trained on a base of content and a we don't know what that basic content is we we really don't know what what they put into the model and we have a right to know because it really determines what the outcome of that thing is but it's also it's been trained on a lot of content that they don't necessarily own so it comes back sometimes with things that that can be pointed at like things that they that it shouldn't have access to or that it at least shouldn't allow other people access to in that way so we've seen in the last few weeks we've seen results that can only come from private GitHub repositories we've seen results that can only come from news articles that are behind a paywall and that are only allowed that search engines are allowed access to because of their first click free but yeah so we've seen we've seen these things where where they are definitely not allowed to use that content automatically for stuff like that but in general it's also like they've clearly used all of the Guardians archives for instance to train their model what did they pay the guardian zero yeah and is that smart is that even for them is that smart I mean they need to train their models on on content that is good and unique if they take away the business models of the people that they train that they need to create content to train their models on yeah then they have a problem themselves but we as Society have an even bigger problem and I fear that if politicians don't get involved in these discussions very quickly that will actually have killed some of uh some of these web Publications in the whole news industry and therefore maybe democracy as a whole to make it yeah it was yeah I so I'm I'm pretty optimistic about how this can be useful and how this can really help us I'm I'm slightly pessimistic about how quickly politicians will start getting this and I hope that they prove me wrong and our we are both Europeans we both live in the Netherlands um Europe the inventor of gdpr the execution was terrible but the intention we both loved the intention of gdpr building a better better privacy rights better data rights for uh citizens of Europe and then the world is it um are policy makers the one we should rely on for fixing the problems that are forming right now I I think we've proven time and again as a tech industry that it's not a really we're not necessarily the best industry for this stuff so and while I would want us to do better things as an industry I don't think we do so yeah I do think that it needs regulation and it needs it fast so I want to cycle back a bit to the AI thing because you made a few things so practical and a lot of people are good and say hey the future with AI we can do this and that and but you are good in explaining how it works how the model works like predicting the most socially accepted next words stuff like that um if Bing chats for example does it really give the sources where it gets the information from yeah it does um it has actually so it's the thing is I've never seen the speed at which this is interacting like like it has been in the last few weeks so when Bing chat first came out I asked it to write a Wiki page about me yeah not necessarily because I think that that's a very interesting thing but because I'm very good at discovering whether there's untruths in that and that's because it's about me and I know myself quite well um so I had to generate that and it said that I was an investor in a company that has actually used seo's biggest competitor which I'm of course not a an investor in um it had some other like completely fake things in there since then they've um added a slider in the Bing chat interface that actually allows you to say more creative or more truthful and when you set it to more truthful it really doesn't generate all that much untruths anymore open AI has been iterating on this as well they don't want to Output those falsities I all of them realize that that is problematic to say the least um but it literally generates URLs sometimes when you ask up a chat GPT for what's the source of that it'll generate a URL on the guardian this has truly happened where it generated a URL that had never existed what it looked like a guardian URL and it could have existed and people emailed the guardian about hey we can't find this article and and the editor the guardian was complaining about this uh uh publicly like Hey we're getting these emails about articles that have never existed checking your sources now means that those articles reference actually exist um but this is the the level to which these AIS come up with facts that aren't facts and I think that's deeply problematic and and that is something that well we can't really fix because it's how Transformer models work which is what all these things are based on a slight historic note there would cheap chat GPT is based on Transformer models which were created at Google and then open sourced um so there is there is a lot of like funny episodes and and Sidelines to this but I think there's [Music] um yeah there's a problem there that we that won't go away easily and it's not a thing that's easily solves in the current iteration of the attack and uh but the the thing with the sources that they can generate there is an option where they have the right sources as well or is that yeah no so being chapters engineering no no so Bing chat really gives you links back to sources but they are like literally references as you would do in a scientific document where it gives like a one two three four and then at the bottom of the list of sources and you can click on them so it'll generate a lot less traffic to websites so in that regard it's already like a huge change for SEO in terms of how much traffic we actually get from these because a lot of the answers that people are seeking will be ha will be given to them right there in the search results yeah and the fact is you're presented as an answer there might be good for you but for a lot of sites that means that their advertising model if that's how they make money disappears yeah so well that's not good no um now I'm personally not the biggest fan of Advertising based models at all because I don't think that that's necessarily the best way to fund things that we really need to to keep our democracy alive like good journalism yeah um but it is what the current model is we have nothing else we've tried a lot of other things and some of us are good citizens of the world and we pay for our newspapers and we and we actually try to fund some of this this journalism but um not enough people do to to keep that in sustained in a sustainable way so we need the ad base model to work for now um yeah I think that's not going to be a simple thing to uh to solve quickly it's it's going to take time but the example that Bing chat sets with you could almost say the provenance of where the information is coming from that's a very different story than the mystical uh realm around AI it's a black box suddenly there's an answer we can't do anything sorry policy makers that's never true right if so and when it's they say it's a black box that's an escape that we should not allow them to have um I machine learning is nice it it really can help you create uh create models and create things that can answer questions for us but that doesn't mean that we don't have as well as humankind the responsibility to know what it's doing so if you're using machine learning and it's actually another Dutch company that I really like by a running part by my brother-in-law called deploy they they have this whole system of making AI explainable or making machine learning explainable because this is literally what people need to do if you get if you apply for credit and your bank doesn't give you credit because their machine learning model told them to not give you credit they need to be able to tell you why and the fact that they're not sometimes it's literally inexcusable the law already forbids it to them they need to be able to do that and there are ways in which you can machine learn that and still know what the why it did what it did and at that point you're really you're learning you're basically using these huge data sets to learn and you learn more efficiently but you're still figuring out what you're learning instead of saying yeah the AI did this yeah that's not a an excuse that we should allow people to give I fully agree of course but it still happens a lot it happens a lot and I think lawmakers just need to be able to say convincingly to themselves as well this is not okay we need to know what the model did and why and if the model has been built in such a way that it can't explain itself well then built it better it's really as simple as that yeah thanks for putting giving so much color on the the many aspects of AI if we go to search search engine optimization we identified some risks but open or let's let's summarize a bit what are the risks for the current game of search engine optimization with the state of AI today so because there's it's so easy to generate a whole lot of content um that looks sort of reliable it doesn't have to be but it looks reliable yeah um suddenly it becomes way more important who authored that content and who who literally attached their name to it yeah um and also because all of that content looks alike and it certainly becomes very important who had it first more than it did for quite a while um so this is actually where workproof I I hope can play a role at some point where we can proof I hey this content was here first and in in a future iteration I hope to also be able that we're going to be able to tie the author to it a bit more reliably because putting content out there that is written in the style of yoast about the person and then attaching my name and picture to it is going to be fairly simple and that doesn't mean that I actually wrote it so we need better ways of proving that I actually did write it yeah and that is something that well we have to figure out systems for and we and we'll we better get to that a bit quicker than we've been doing that for a while um but I think that search engines to some extent already do that quite well they know authors they have been doing this sort of author linking to uh to content and I think that they'll focus on that more and more and more so these eeat signals expertise experience Authority trust the types of signals that are simply literally putting an author on the page showing who authored the content showing when that content was authored and then linking to the social profiles of that author so that they can the search engine can tie all of it together and then can check whether that author actually links back to that website and there's this whole thing that they need to check the credibility of that of that link but they can they have been doing that for well over a decade but they've not done it as well it wasn't that much of a like first results thing that they needed to do it wasn't that important to them but it will be that important to them fairly quickly now because um that actually leads to them being able to say we trust this author we want to show these results in our search results yeah so there will be a bigger focus on eaat and um there's another aspect to that Google roll out a Content called an update called The Helpful content update earlier last year and so people are already saying that we should actually call it heat heat because of course the content needs to be helpful as well and it's actually answered a question that the person that is searching for has I don't think that's new that's basically been SEO for forever that's what you have been saying for over a decade simply be the best results yeah simply be the best result for the things that people are searching for that is I mean it is not harder than that but it's also not easier than that because that's actually hard um but I I do think that that thinking about it in that way like hey it has to be helpful and we have to show why we we as a site or as an or I as an offer have the experience and the expertise and the authority to say something about this that other people should trust yeah and I think that that combination of things will become super important yeah and uh I also think that that and Gifts gives a chance like it gives a real fighting chance to newspapers and to to good bloggers and to experts and to Medical sites what you what you're already seeing these sites do as a result of eat is things like showing who wrote the article which which doctor checked it who updated the article who copy edited the article you're seeing people show stuff like that on their web pages and I love it yeah it's the kind of transparency that really really will improve how we build the web and it's it's sort of like the missing layer of the web it was always it was built on the assumption that we could trust everything that was put down there initially because it was built to to just transfer documents between universities and and and institutes like that on places communities where trust was part of the fabric of that community so the technology didn't need to solve that no and and now it does really need to solve that I mean and I think we need to get to a technical layer that is relatively simple to understand um but where we can where people can verify that people that they think wrote the article actually wrote the article or where we can verify that the person who checked the content was actually a medical doctor yeah if it goes if we're talking about medical info Google focuses on this a lot more in a couple of specific specific niches already what they call the your money your life yeah uh content because all the content in that type of area so Medical but also loans credit cards Etc is super vital that the results they give to people are truthful yeah and we well as in SEO one of the things that we have to tell people to do is like show why people should trust you yeah build that build that profile understand that the author is now and who the author is a particular piece of content is now an important aspect of what that content is and I hope that we'll get to Technical Solutions that allow us to prove that in in a more well certain way than we can currently yeah um but it's already like in SEO specifically right now it's already important to really start doing that and uh so to summarize a bit there's the risk of information overload uh there's the risk of reliability then with the information overload comes how to rank and who owns what um and provenance who is the center of information it's what we say at wordproof and trusted web or what we fight for is firstly an open source way to prove who publishes what when and secondly making a relationship between the amount of accountability a sender takes and how far information could uh yeah or the impact between identity connected to information the sender and how high it ranks so that you always have freedom of speech you can publish every on the internet but there must there must be a certain level of accountability before it can rank high yeah and at the same time that brings problems too right for sure because to a certain degree anonymity is also a super important thing and if you're in China or in Iran or even in the US if you're if if you're saying the wrong things you don't always want to attach your identity to a piece of content and yeah and for you to be forced into actually attaching your identity to a piece of content for it to rank it's also a bit of a problem so we need to figure out how to do that yeah we need to really figure out like how how do we still allow for that and which methods can we come up with that allow for anonymity on stuff that needs anonymity um in the blockchain space there are things being developed right now one of the things uh zero knowledge proof so in my there's there's my identity in my wallet and in a liquor store for example I can say uh or the question is often how old are you and can you prove with your ID but it's not their business what my age is it's only their business binary if I'm old enough or not yeah zero knowledge proof is a technology for exactly this so I can show only the relevant thing am I a doctor with a degree uh that's reason and that's not being revoked for example so there are dots that need to be connected over the coming years yeah you're the third guest in this season the first one was Scott sternetta the inventor of blockchain 30 years ago so an open source technology to prove that you didn't tamper with information the second one was Andy Parsons from Adobe and they run because the content authenticity initiative um for them the point they're making is Radio manifest is for the web to move forward provenance is super important it's it's for monetization it's for trust it's for I it's largely a topic um in the answer if you're giving uh giving the answers he gave that provenance and identity is the way forward um maybe a weapon against AI weapon is a too strong word there but Providence is a really good way to an important route forward uh to handling the challenges we face with AI yeah I I fully agree and um I also don't want to be like known as the guy that's against AI I think it I think as I said earlier there's just huge chances in using AI properly and and actually a huge things that we as Humanity can do with that and and move things forward with um but yeah Providence in search results and provenance in and search results is maybe not is it it's too restricted of an area to talk about this everywhere where AIS give results um we need to actually well we as a society you need to know why it gives those results and we need to be able to dig into that yeah and I think that that is uh um yeah I I can't understate how how important that is it or overstate or whatever I don't even know the words anymore no but it's really really important that as as a society we know what these models do and how they work and that we and then we make them um explainable yeah and reliable periods period yes yeah there's there's so much uh key key insights for policy makers here we'll make sure to distribute them uh there specifically on AI and Mr information is there a role for AI in fighting misinformation or disinformation well maybe if it can recognize it reliably um I I think to some extent uh search engines will always use machine learning models to to try and determine the authenticity of content and and whether they can rank it or not it's not like humans are ever going to be creating search results so uh yes there's also a a a a a a a a a place where those models to actually help fight that same misinformation that is caused by similar models yeah and the hard thing is we don't trust people to judge if this information is real or not no I I think the hardest part about this is that if you generate a piece of text with it with an AI um and you check 10 facts in there the thing that you don't check because it looks so incredibly reliable might be wrong might be wrong and you publishing it might actually make make it worse so that's where it's problematic could imagine lastly taking action as a publisher or as an e-commerce business or as a government publishing information what action should I take should I allow and what action should I take firstly and secondly more specifically do I want my content to be in the AI models or not so what action should you take well there's not a whole lot you can do you can start time stamping your content would work proof and you should this is the investor in research and I think that might actually be helpful in the long run if you start doing it now is it the sooner you start the better but there's not a whole lot more you can do right now we don't really have standards yet for any of this and well they are being developed as we speak this is going incredibly fast like literally there's news every day um so that'll change soon and right now you can't do much do we do you want your content to be in those large language models I think maybe yes maybe no depends a bit on what you do where you are Etc but it also depends on the model like in some models you might be delighted to be in there if they're used to say you're a a publisher of a very specific piece of information and there and someone is building a model for your industry and is doing that in a way where it gives links to the source every time then of course you want to be in that model because it's good for you um but this becomes a licensing question really quickly absolutely and for the chat GPT model it is possible to reverse engineer the links for the sources that's what we learned in this country yeah to some degree it's hard but it but but you can sometimes see that it gives results that are um well too similar to stuff that that it that can only be found in a specific repositories of information yeah uh so like the watermark and Getty Images images like the watermark and get the images images that is showing up in I think that was mid-journey but it but yeah it happens in all these models yeah Yos thanks so much for giving so much color to uh everything Ai and search engine optimization and provenance as a Way Forward um are there things that we missed in a conversation that need to be added no I don't think so I do think that it's important for everyone to go play with these AI models and to see for themselves what it really is thanks so much this was a wonderful conversation thank you for having me it was really fun to talk about all this yeah wonderful and I can't wait to follow up in a year or so to see uh how the first things unravel I'm always here for that as you know perfect I hate to talk soon foreign [Music]
Esc