- News Corp Australia revealed that they are using generative AI to publish 3,000 news a week
- Wayfair is using text-to-image AI models to show how you could redecorate your living rooms with furniture that might resemble the ones in their catalog
- 3M Health Information Systems, Babylon Health, and ScribeEMR have adopted the new AWS HealthScribe service to automatically generate clinical notes from doctor-patient conversations
- In the Prompting section, we discover that large language models might lose accuracy with larger context windows.
- In the The Tools of the Trade section, we use LM Studio and the new Stable Beluga 2 model to create a personal AI assistant that runs on our computers.
Aaaaaaaaand we are back. Thanks for your patience while I tried to recharge a big.
Half of this Splendid Edition is a bit more technical than usual this week. Apologies if some of you will find it a bit much.
Even if you are not interested in the topic, I highly recommend you read it, because it will clarify certain mechanisms that we’ll encounter again in the near future when talking about GPT-4.
If, instead, you liked the content of this week, as usual, I’d appreciate you sharing this newsletter with friends and colleagues, and talking about it on social media.
Thank you.
Alessandro
What we talk about here is not about what it could be, but about what is happening today.
Every organization adopting AI that is mentioned in this section is recorded in the AI Adoption Tracker.
In the Publishing industry, News Corp Australia revealed that they are using generative AI to publish 3,000 news a week.
Amanda Meade, reporting for The Guardian:
News Corp Australia is producing 3,000 articles a week using generative artificial intelligence, executive chair Michael Miller has revealed.
Miller told the World News Media Congress in Taipei that a team of four staff use the technology to generate thousands of local stories each week on weather, fuel prices and traffic conditions
…
Stories such as “Where to find the cheapest fuel in Penrith” are created using AI but overseen by journalists, according to a spokesperson from News Corp. There is no disclosure on the page that the reports are compiled using AI.
…
The spokesperson confirmed Miller had made the comments at a conference last month and said it would be more accurate to describe the “3,000 articles” as providing service information.“For example, for some years now we have used automation to update local fuel prices several times daily as well as daily court lists, traffic and weather, death and funeral notices,” the spokesperson said.
“I’d stress that all such information and decisions are overseen by working journalists from the Data Local team.”
…
News Corp recently advertised for a data journalist whose tasks include creating “automated content to build a proposition and pipeline for revenue”.
As we wrote in Issue #5 – The Perpetual Garbage Generator, this evolution is inevitable.
And while, for now, AI is used for relatively simple article generations (like turning dry data into a narrative that is more digestible by humans), you have to ask yourself what will happen once GPT-5 comes out next year.
There is no logical reason to believe that more capable AIs won’t write more sophisticated articles, trouncing journalists in terms of delivery speed and, eventually, quality.
For example, what happens if I feed a future GPT-5 or GPT-6 with a database of shares sold and bought by US politicians, together with a database of stock prices and news articles published during the same period, and then I ask the AI to find correlations and describe them with a compelling narrative that resembles a spy story?
In the E-commerce industry, Wayfair is using text-to-image AI models to show how you could redecorate your living rooms with furniture that might resemble the ones in their catalog.
Kyle Wiggers, reporting for TechCrunch:
Wayfair launched a new app with a cutesy name, Decorify, that lets customers create “shoppable,” “photorealistic” images of rooms in their homes or apartments in new styles. After uploading a picture of the space they want to redesign, the app enables them to transform the space with styles like “Modern Farmhouse,” “Bohemian” or “Industrial.”
…
For now, Decorify only supports living rooms, disappointingly — Wayfair says that’ll change in the future.Once you’ve uploaded a photo, Decorify prompts you to choose a style: “Traditional,” “Modern,” “Farmhouse,” “Bohemian,” “Rustic,” “Industrial,” “Glam” or “Perfectly Pink.” Then, the app “transforms” your room, replacing the fixtures, rugs and furniture in the photo with products in the selected style.
…
“Combining the visual discovery and inspiration with shoppable products from an extensive catalog that covers everything home, Decorify has the potential to drive significant value for Wayfair customers,” Sadalgi added. “Wayfair hopes Decorify can kick off a camera-first shopping paradigm for visual discovery where the journey simply starts with clicking an image.”In my brief testing, Decorify worked well enough — style depending. I had the best luck with “Rustic” and “Modern,” which, true to the app’s sales pitch, replaced my pictured couch, TV, entertainment console, table, chairs, siding and even ceiling with items you might find in Wayfair’s catalog.
…
Decorify isn’t perfect, though. The products in the transformed photos aren’t actually real — they’re only meant to serve as inspiration. Decorify provides a list of the best matches available on Wayfair, but the suggestions weren’t often that close in my experience.
…
Currently, Decorify is using Stable Diffusion, the open source diffusion model from Stability AI, as the base model. But Wayfair plans to gradually fine-tune the model on its in-house image data.“We’re confident that we can fine-tune the model using proprietary Wayfair brand data so the results look a lot closer to a lifestyle photo that you see on Wayfair today, with more creative control over the style and one-to-one correlation with real products,” Sadalgi said. “In the short term, Decorify will support more rooms and improve the shoppability … We plan to offer swapping out furniture with specific furniture you may have already found on Wayfair, so customers can visualize how a product may look in their space.”
So, to be clear: this is not a virtual placement of Wayfair catalog items in your physical space, via augmented reality, like we have seen for years from IKEA and others.
This is a so-called image-to-image process that takes the picture of your living room as input and uses it to constrain the generation of a new image together with a pre-defined prompt created by Wayfair.
Behind the scenes, the process likely is the same as the one I described in the intro of the Free Edition of this week, leveraging auxiliary AI models from the ControlNet family to constrain Stable Diffusion.
Actually placing their products would require an enormously more complicated pipeline: Wayfair would first have to segment your living room with a semantic segmentation model, then they would have to recognize what segment they can paint over with their products, then they would have to layer the digitalized version of their products on top of each usable segment, and then they would have to perform an image-to-image generation to make everything blend together in a realistic way.
And all of this, only after they would have trained a model to associate a subset of their catalog to specific semantic categories. In other words, if the predefined prompt says “Industrial Design”, what products should show up as potential candidates for the inpainting phase?
It’s not an impossible job but, at the moment, it’s very expensive and it still requires human supervision to pick the best candidates from the catalog. To automate this last step, too, Wayfair would have to fine-tune a model with a technique like Reinforcement Learning from Human Feedback (RLHF) to teach the AI how to pick like a human designer.
I’d be shocked if this approach wouldn’t become the standard in interior design in the future.
In the Health Care industry, 3M Health Information Systems, Babylon Health, and ScribeEMR have adopted the new AWS HealthScribe service to automatically generate clinical notes from doctor-patient conversations.
Kyle Wiggers, reporting for TechCrunch:
At its annual AWS Summit conference in New York, Amazon unveiled AWS HealthScribe, an API to create transcripts, extract details and create summaries from doctor-patient discussions that can be entered into an electronic health record (EHR) system. The transcripts from HealthScribe can be converted into patient notes by the platform’s machine learning models, Amazon says, which can then be analyzed for broad insights.
…
HealthScribe identifies speaker roles and segments transcripts into categories based on clinical relevance, like “small talk,” “subjective comments” or “objective comments.” In addition, HealthScribe delivers natural language processing capabilities that can be used to extract structured medical terms from conversations, such as medications and medical conditions.The notes in HealthScribe, augmented by AI, include details like the history of the present illness, takeaways and reasons for a visit.
Amazon says that the AI capabilities in HealthScribe are powered by Bedrock, its platform that provides a way to build generative AI-powered apps via pretrained models from startups as well as Amazon itself.
…
So is HealthScribe consistent? Can it be trusted, particularly when it comes to deciding whether to label a part of a discussion as “subjective” or “objective” or identifying medications? And can it handle the wide array of different accents and vernaculars that patients and providers might use?The jury’s out on all that.
But perhaps in an effort to prevent some of the more major potential mistakes, HealthScribe can only create clinical notes for two medical specialties at present: general medicine and orthopedics. And the platform offers clinicians a chance to review notes before finalizing records in their EHR, providing references to the original transcript for sentences used in the AI-generated notes.
…
HealthScribe is “HIPAA eligible,” however — meaning customers who work with Amazon to meet HIPAA requirements can ultimately reach compliance. Those customer must sign a contract known as a business associate addendum, which AWS’ documentation covers in detail here.Amazon says that 3M Health Information Systems, Babylon Health and ScribeEMR are among the companies already using HealthScribe.
Alongside HealthScribe, Amazon today announced AWS HealthImaging, a service designed to make it easier to store, transform and analyze medical imaging data “at a petabyte scale.”
The always-listening AI will transcribe every conversation in the doctor’s office, turning clinical documentation into a frictionless process. Should we ever overcome how awkward it is to be recorded and transcribed in every moment of our visit, that is.
Unrelated to this, I can see a future where somebody will try to do the same thing in sales conversations with prospects.
Before you start reading this section, it's mandatory that you roll your eyes at the word "engineering" in "prompt engineering".
Before we talk about any additional prompting technique, this week we need to take a step back and take into account an important new discovery that might impact how we interact with our AI models.
Since the very beginning of Synthetic Work, we’ve stressed how a larger context window is game-changing for sophisticated tasks like document analysis.
You certainly remember that, before the summer break, we used the new GPT-4 Code Interpreter model (which sports an 8K token context window, soon to be upgraded to 32K) and Claude 2 (which sports a 100K token window, eventually to be upgraded to 200K).
If you missed them, check Issue #22 – What AI should I wear today? and Issue #21 – Investigating absurd hypotheses with GPT-4.
As exciting as this is, new evidence suggests that these models lose much of their accuracy with longer context windows.
Look at this example:
Notice that there’s a typo in the tweet: it’s GPT-4 that loses 87% of accuracy and not Claude 2.
Other researchers have corroborated this finding only with anecdotal evidence. We’ll need a more thorough investigation to confirm this behavior, but it’s worth keeping it in mind when we write very long prompts or we analyze long documents.
The time has come for us to start testing open access and open source large language models. These models have reached a level of maturity that starts to match the performance of GPT-3.5-Turbo in some tasks and, at this pace, we might see them getting close to GPT-4 level of performance by the end of the year.
There are many reasons why you’d want to pay attention to these models.
Perhaps because your company doesn’t want to depend on a single AI provider like OpenAI, Anthropic, or Microsoft. The more capable their models become, the more they will be able to charge for them.
Perhaps your company doesn’t want to depend on an AI model that can wildly fluctuate in terms of accuracy over time. You may have read the heated debate between OpenAI and its customer base about the dramatic drop in accuracy of both GPT-3.5-Turbo and GPT-4 from March to Now.
Perhaps your company wants to build a product on top of AI providers’ APIs but the cost is exorbitant for the use case you have in mind, making the whole business financially unsustainable.
Perhaps your company wants to fine-tune an AI model but you don’t feel comfortable sharing your precious proprietary data with an AI provider or you don’t want to wait for them to make the fine-tuning process easy and accessible (right now it’s more akin to a consulting engagement for most companies).
Perhaps your company wants to achieve a level of customization that is simply impossible to achieve with the security scaffolding that the AI providers have put in place to protect their models from prompt injections, reputational risks, and other potential liabilities.
We can go on for a while.
The point is that, at least in this early stage of technology evolution for generative AI, you may want to keep all your options open.
Just last months ago, in the text-to-image (more properly called diffusion) models space, it seemed that nobody could beat Midjourney. To enjoy their frictionless generation of stunning AI images, you would have to accept the lack of flexibility in image composition and final look & feel. Because the alternative was a less-than-impressive quality produced by alternatives like Stable Diffusion and Dall-E 2.
Then, Stability AI released Stable Diffusion XL (SDXL) and everything is worth reconsidering now.
The same might happen with language models and the current dominance of GPT-4.
To understand how much these models have matured, we’ll test a version of the new LLaMA 2 70B model released by Meta, and fine-tuned by Stability AI.
LLaMA 2 is not a single model, but a family of models that range from 7B to 70B parameters.
For the sake of simplicity, imagine that the parameters of an AI model correspond to the neurons of a human brain. The more parameters, the more capable the model is (and the more computing resources it needs to run).
It’s way more complicated than this, but you don’t want to read the full explanation, trust me.
Now.
In the AI world, everything has an unnecessarily complicated and misleading name. More importantly, everything has a name, including fine-tuned models. So, the Stability AI fine-tuned version of LLaMA 2 70B is called Stable Beluga 2. The original name was FreeWilly 2, but the company had to change it in less than one week, probably because of a trademark issue.
Testing an open access or open source model is not as easy as loading the website of OpenAI or Anthropic or talking to Bing inside Microsoft Edge or Vivaldi. But we can make it as simple as possible.
If the guide below is not working for some reason, reach out to me over the Discord server and I or another member of the Synthetic Work community will be happy to help you.
AI Model Download
The first thing we want is an application that loads the AI model and provides the scaffolding necessary for the model to be tricked into answering our questions in a chat.
Remember that large language models are glorified autocomplete systems. They were never designed to answer questions. The fact that we can do that so well depends exclusively on how we fool them via the scaffolding that composes an AI system.
We are very lucky that LM Studio exists. It’s a free application for macOS and Windows (and soon Linux) that we encountered in Issue #18 – How to roll with the punches.
As I wrote back then, the alternatives are so complicated and convoluted that I could never explain how to use them in a newsletter like Synthetic Work.
Install LM Studio and open it. You’ll be welcomed by a search bar on the home screen:
Search for Stable Beluga 2 and download the exact version I highlighted below:
GGMLv3 is a format, and the q stands for quantization.
We can safely ignore all of this technical mumbo-jumbo.
The only thing that is important to notice is that this particular AI model is demanding in terms of computer resources and not every reader will have a computer powerful enough to run it.
If that’s your case, you could download an alternative AI model like Stable Beluga 7B or Stable Beluga 13B, but you have to keep in mind that these smaller models don’t match the performance of GPT-3.5-Turbo.
Click Download and watch LM Studio beautifully take care of the model download for you.
These models are huge in terms of file size, so be sure to have plenty of storage space and a fast internet connection.
Done?
Congratulations. You have just discovered that the ethereal entity we call Artificial Intelligence is nothing more than ONE BIG FAT FILE.
The brain that moves the world can be easily copied from one computer to another via a simple USB drive or file transfer.
Did I break the magic for you?
LM Studio Setup
We don’t really have to do anything else to use Stable Beluga 2 as our personal and local ChatGPT. But if you want to get the most out of this system, there are a few things you could tweak.
This section is completely optional and you can skip it.
The most important configuration option is the System Prompt. What is it?
It’s the main magic trick that OpenAI, Anthropic, Microsoft, Google, etc. use to control how their AI models behave.
In other words, this is one of the secret ingredients that make Bing Chat talk like a psychopath. Or not.
Here called System Prompt, the feature has many names. A couple of weeks ago, OpenAI announced the capability to define a system prompt for its ChatGPT models and called it Custom Instructions.
The System Prompt allows you preset and constrains the behavior of the AI model regardless of what the user will ask during the chat.
Look how dramatically different the answer of the same model to the same question is with two different system prompts.
System Prompt default in LM Studio:
System Prompt recommended by Meta for LLaMA 2:
The beauty of this feature is that you can change the system prompt, and until a couple of weeks ago, it was an advantage offered exclusively by open access and open source models.
Today’s goal is not to explore the influence of the System Prompt, so we’ll keep it as is. In a future Splendid Edition, as soon as OpenAI enables Custom Instructions for UK citizens like me, we’ll see if can create the ultimate system prompt by combining many of the techniques we have listed in the How to Prompt section of Synthetic Work. And then we’ll use it in both ChatGTP and in LM Studio.
Other things that you might want to tweak if it’s not too intimidating:
- Context Lenght: This setting defines the size of the the famous context window. The default is set to 1,500 tokens (approx 1,125 words), but you can extend it to the amount that has been used for the original training of LLaMA 2: 4,000 tokens (approx 3,000 words).
- Keep entire model in RAM: This setting speed up the generation of the model answers, but subtract memory from every other activity of the system. Flag this option only if you have a huge amount of memory availabe. For example, my MacBook Pro, which I customized in Jan 2023 specifically to run future AI models locally, has 96GB of RAM.
- Apple Metal (GPU): This setting allows the execution of the AI model in the Apple Graphic Processing Unit (GPU), which is where all AI models perform best. Without it, the model will be executed by the CPU and it will be slower than paint drying on a wall. If you have a Windows machine with an NVIDIA GPU, you won’t see this.
- CPU Threads: This is the number of physical cores that LM Studio will use to run the AI model. It will further increase the speed at which the model will generate its answers. Find out how many physical cores your computer has (often described in the configuration you ordered when you purchased the machine) and change the setting accordingly.
- Chat Appearance: This setting will format the answers of the model to look more pleasant, especially if you are asking it to generate some code. It’s purely aesthetical.
If you have a system similar to mine and you have done everything right, the right sidebar of your LM Studio should look like this :
Chatting with Stable Beluga 2 (and others)
It’s finally time to see how Stable Beluga 2 performs compared to OpenAI and Anthropic models.
Let’s go back to one of our first tutorials with GPT-4, from Issue #8 – The Harbinger of Change.
Do you remember how we used the Assign a Role technique to convince GPT-4 to write a very convincing invitation for an industry analyst?
This was the original outcome:
Let’s see what Stable Beluga 2 can do with the same prompt:
Meh.
But this is not a fair comparison. We said that open access and open source LLMs are getting close to the performance of GPT-3.5-Turbo, not GPT-4.
So what does GPT-3.5-Turbo do with the same prompt?
Clearly, GPT-3.5-Turbo has no idea of what “don’t be bombastic” means. Stable Beluga 2 seems the better option.
Given that we are at this, what about Claude 2?
Really bad. Here you have the first example of an open access AI model (LLaMA 2 and derivatives like StableBeluga are not open source) that trounces a commercial one.
Let’s try something else.
Again, in Issue #8 – The Harbinger of Change, we used the Ask for a Reason technique to generate a very articulate market report template.
This was what GPT-4 generated back then:
This, instead, is Stable Beluga 2:
It’s not bad. Not articulated as the one produced by GPT-4, but not terrible.
What about GPT-3.5-Turbo?
I’d say it’s a tie.
Claude 2?
This feels even more superficial than the ones generated by Stable Beluga 2 and GPT-3.5-Turbo. So, no.
Last test.
Let’s move to Issue #9 – The Tools of the Trade, when we encountered for the first time the Awake the Curious Child technique.
This was the original answer we got from GPT-4:
Can Stable Beluga 2 do any better?
Nope. The AI model really struggles to ask the why in a recursive way for 5 times and it’s inconsistent in how many times it performs the recursive loop.
Time to check GPT-3.5-Turbo:
The result is much better. But the model is overzealous and in the first example it asks why six times instead of five. It also struggles to arrange the answer in a numbered list.
Let’s close with Claude 2:
Thankfully, Claude 2 redeems itself in this last scenario, correctly asking why five times and providing coherent answers.
All is well that ends well.
The key takeaway here is that the quality of Stable Beluga 2 is not the end, but the beginning. Stability AI worked together with Meta to release a fine-tuned version of LLaMA 2 immediately after the original model was released.
The rest of the AI community is working around the clock to further optimize the LLaMA 2 fine-tuning process. In fact, as we speak, we already have another fine-tuned model that surpasses Stable Beluga 2 in terms of quality:
And you, now, have a tool to explore and assess the progress in this space. Which also happens to be a tool to appreciate a little bit more the incredible things that GPT-4 can do compared to all its competitors.