STORM: Get a Wikipedia-like report on your topic

storm.genie.stanford.edu

138 points by fragmede 3 days ago

accurrent a day ago

I gave a prompt and it stright up hallucinated. My prompt was about writing an article about the advantages and disadvantages of rust in the robotics ecosystem. It claimed that google cartographer was written in rust. The annoying thing about this is that it was quite convincing, I found the citation it used to be geeks for geeks blogspam that did not mention cartographer any where so I went and checked it was a C++ only project. Its worrisome when you see people relying on llms for knowledge.

jeroenhd a day ago

People trusting LLMs to tell the truth is the advanced version of people taking the first link on Google as indubitable facts.
This whole trend is going to get much worse before it gets better.
- tikkun a day ago
  
  I'm optimistic that hallucination rates will go down quite a bit again with the next gen of models (gpt5 / claude 4 / gemini 2 / llama 4).
  I've noticed that the hallucination rate of newer more SOTA models is much lower.
  3.5 sonnet hallucinates less than gpt 4 which hallucinates less than gpt 3.5 which hallucinates less than llama 70b which hallucinates less than gpt 3.
  - nytesky a day ago
    
    Eventually won’t most training data be AI generated? Will we see feedback issues?
leettools 20 hours ago

We are actually working on a tool that provides similar functions (although we focus more on the knowledgebase curation part). Here is an article we generated from the prompt "the advantages and disadvantages of rust in the robotics ecosystem" (https://svc.leettools.com/#/share/leettools/research?id=9886...): the basic flow is to query Google using the prompt, generate the article outline using the search result summaries, and then generate each section separately. Interested to see your opinions on the differences, thanks!
- accurrent 15 hours ago
  
  I'm impressed, its better than the article I found written by Storm. That being said both tend to rely on whats available on the internet, so lack things that are more subtle. Its impressive that your article picked on Pixi. Of course as a practicing roboticist my arguments would be different, but at this point I'm knitpicking.
  - leettools 13 hours ago
    
    Thanks for the feedback! Yeah, by default this kind of survey articles are generated by publicly available information through search results. So the quality depends a lot of Google's ranking mostly and your search terms. Right now we can add expert-picked documents to the KB and generate the results from the curated KB instead directly from the search. Better prompting (specific to the target field of study) and more iterations (have a quality check and rewrite accordingly) should also be very helpful.
Meganet a day ago

[dead]

kingkongjaffa 2 days ago

Very cool! I asked it to create an article on the topic of my thesis and it was very good, but it lacked nuance and second order thinking i.e. here's the thing, what are the consequences of it and potential mitigations. It was able to pull existing thinking on a topic but not really synthesise a novel insight.

Synthesis of Topic Outlines through Retrieval and Multi-perspective Question Asking.

From the paper it seems like this is only marginally better than the benchmark approach they used to compare against:

>Outline-driven RAG (oRAG), which is identical to RAG in outline creation, but

>further searches additional information with section titles to generate the article section by section

It seems like the key ingredients are:

- generating questions

- addressing the topic from multiple perspectives

- querying similar wikipedia articles (A high quality RAG source for facts)

- breaking the problem down by first writing an outline.

Which we can all do at home and swap out the wikipedia articles with our own data sets.

kingkongjaffa 2 days ago

I was able to mimic this in GPT with out the RAG component with this custom instruction prompt, it does indeed write decent content, better than other writing prompts I have seen.
PROMPT: create 3 diverse personas who would know about the user prompt generate 5 questions that each persona would ask or clarify use the questions to create a document outline, write the document with $your_role as the intended audience.
- westurner 2 days ago
  
  PROMPT`: Then, after conducting background research, Generate testable and untestable hypotheses and also suggestions for further study given market challenges and relevant marginally advantageous new and proven technologies.

dredmorbius 2 days ago

"Sign in with Google" is a show-stopper.

zackmorris a day ago

Ya and unfortunately this is from Stanford. It's a private university, but that's still not a good look. It's amazing in 2024 that so many demos, especially in AI, are getting this wrong.
We're long overdue for better sources of online revenue. I understand that AI costs money to train (I don't believe that it costs substantial money to run - that's a scam) but if we thought that walled gardens were bad, we ain't seen nothin yet. We're entering an exclusive era where the haves enjoy vastly more money than the have nots, so basically the bottom half of the population will be ignored as customers. The good apps will be exclusive clubs that the plebeians gaze at from afar, like a reverse zoo.
I just want something where I can pay 1 cent to $1 to skip login. Ideally from a virtual account that's free to use but guilts me into feeding it money. So maybe after 100 logins I pay it a few dollars. And then a reward system where wealthy users can pay it forward so others can browse for free.
I would make it in my spare time, but of course there is no such thing in the 21st century climate of boom-bust cycles and mass layoffs.
anotheraccount9 a day ago

Yes, and it's not possible to delete the account (or association with).
jgalt212 a day ago

And it's a challenge not to click that modal in error.
Meganet a day ago

[dead]

mburns 2 days ago

Reminds me of Cuil.

> Cuil worked on an automated encyclopedia called Cpedia, built by algorithmically summarizing and clustering ideas on the web to create encyclopedia-like reports. Instead of displaying search results, Cuil would show Cpedia articles matching the searched terms.

https://en.wikipedia.org/wiki/Cuil

chankstein38 2 days ago

Does anyone have more info on this? They thank Azure at the top so I'm assuming it's a flavor of GPT? How do they prevent hallucinations? I am always cautious about asking an LLM for facts because half of the time it feels like it just adds whatever it wants. So I'm curious if they addressed that here or if this is just poorly thought-out...

EMIRELADERO 2 days ago

Here's the paper: https://arxiv.org/abs/2402.14207
- akiselev 2 days ago
  
  And the code: https://github.com/stanford-oval/storm
- morsch 2 days ago
  
  Thanks. There's an example page (markdown) at the very end. You can pretty easily spot some weaknesses in the generated text, it's uncanny valley territory. The most interesting thing is that the article contains numbered references, but unfortunately those footnotes are missing from the example.
Sn0wCoder a day ago

Not sure how it prevents hallucinations, but I tried inputting too much info and got a pop-up saying it was using Chat GPT 3.5 The article it generated was OK but seemed to repeat the same thing over and over with slightly different wording
infecto 2 days ago

If you ask an LLM what color is the sky it might say purple but if you give it a paragraph describing the atmosphere and then ask the same question it will almost always answer correctly. I don't think hallucinations are as big of a problem as people make them out to be.
- misnome 2 days ago
  
  So, it only works if you already know enough about the problem to not need to ask the LLM, check.
  - infecto 2 days ago
    
    Are you just writing negative posts without even seeing the product? The system queries the internet, aggregates that information and writes information based on your query.
    
    misnome a day ago
    
    ChatGPT, please explain threaded discussions and context of statements as if you were talking to a five year old.
    
    infecto a day ago
    
    Ahh so you are a child who has no intellectual capability past writing negative attack statements. Got it.
  - keiferski 2 days ago
    
    No, if the data you’re querying contains the information you need, then it is mostly fine to ask for that data in a format amendable to your needs.
    
    o11c 2 days ago
    
    The problem with LLMs is not a data problem. LLMs are stupid even on data they just generated.
    One recent catastrophic failure I found: Ask an LLM to generate 10 pieces of data. Then in a second input, ask it to select (say) only numbers 1, 3, and 5 from the list. The LLM will probably return results numbered 1, 3, and 5, but chances are at least one of them will actually copy the data from a different number.
    
    wsve 2 days ago
    
    I'm absolutely not bullish on LLMs, but I think this is kinda judging a fish on its ability to climb a tree.
    LLMs are looking at typical constructions of text, not an understanding of what it means. If you ask it what color the sky is, it'll find what text usually follows a sentence like that, and tries to construct a response from it.
    If you ask it the answer to a math question, the only way it could reliably figure it out is if it has in its database an exact copy of that math question. Asking it to choose things from a list is kinda like that, but one could imagine that the designers would try to supplement that manually with a different technique from pure LLM.
    
    smcin 2 days ago
    
    Any ideas why that misnumbering happens? It sounds a very basic thing to get wrong. And as a fallback, could be brute-force kludged with an extra pass which appends the output list to the prompt.
    
    o11c 2 days ago
    
    It's an LLM, we cannot expect any real idea.
    Unless of course we rephrase it as "when I roll 2d6, why do I sometimes get snake eyes?"
- pistoriusp 2 days ago
  
  Yet remains unsolvable.
  - infecto 2 days ago
    
    Huh?
- infecto 2 days ago
  
  Why does this get downvoted so heavily? It’s my experience running LLM in production. At scale hallucinations are not a huge problem when you have reference material.
- chx 2 days ago
  
  There are no hallucinations. It's just the normal bullshit people hang a more palatable name on. There is nothing else.
  https://hachyderm.io/@inthehands/112006855076082650
  > You might be surprised to learn that I actually think LLMs have the potential to be not only fun but genuinely useful. “Show me some bullshit that would be typical in this context” can be a genuinely helpful question to have answered, in code and in natural language — for brainstorming, for seeing common conventions in an unfamiliar context, for having something crappy to react to.
  > Alas, that does not remotely resemble how people are pitching this technology.

gpderetta 7 hours ago

It gets commented often, but:

   `Panther Moderns,' he said to the Hosaka, removing the trodes. `Five minute precis.' `Ready,' the computer said.

siscia a day ago

We have been discussing a similar idea with friends.

The topic of knowledge synthesis is fascinating, especially in big organisations.

Moving away from fragmented documents into a set of facts from which LLM synthetize documents from, tailored for the reader.

There are few tricks that would be interesting to have working.

For instance the agent keep evaluating itself against a set of questions. Or user adding questions to see if the agent is able to understand the nuances of the topic and so if it can be trusted.

(Not dissimilar to what would be regression testing in classical software engineering)

Then the "homework" sections, when we ask human experts to evaluate that the facts stored by the agents are still relevant and up to date.

All these can then be enhanced with actions usable by the agent.

Think about fetching the PoC for a particular piece of software. It is the employer Foo.

If we write this down in a document, it will definitely get outdated when Foo move, or get promoted.

If we put this inside a knowledge synthesis system, the system itself may keep asking every 6 months to Foo if it is still the PoC for the software project.

Or it could daily talk with the LDPA system and ask the same question as soon as it notices that Foo has changed its position or reporting structure.

This can be expanded for processes to follow. Report to create, etc...

DylanDmitri 2 days ago

Seems a promising approach. Feedback at the bottom is (?) missing a submit button. Article was fine, but veered into overly verbose with redundant sections. A simplification pass, even on the outline, could help.

kingkongjaffa 2 days ago

It auto-saves I believe.

WillAdams a day ago

I keep getting:

>Sorry, STORM cannot follow arbitrary instruction. Please input a topic you want to learn about. (Our input filtering uses OpenAI GPT-3.5, which may result in false positives. We apologize for any inconvenience.)

Nydhal a day ago

You have to input the title of the article you want, not instructions like "write me ..."
- WillAdams a day ago
  
  I did. Eventually I managed to get an article, but by then it was generic enough to not be particularly useful.

anotheraccount9 a day ago

A very interesting project. Btw, I could not find a way to delete my account when created. I've also found that the generated report is very generic and quickly goes outside the actual question or specific theme/keywords used.

A final point, the notice states that "The risks associated with this study are minimal. Study data will be stored securely, in compliance with Stanford University standards, minimizing the risk of confidentiality breach." When I use STORM, I can see other people's request. Are they supposed to be confidential?

OutOfHere 2 days ago

STORM motivated me to independently create my own similar project https://github.com/impredicative/newssurvey which works very differently to write a survey article on a medical or science topic. Its generated samples are linked in the readme.

sitkack 2 days ago

I like this, but the quality is lower and more voluminous than phind or perplexity.

But I like the direction of the research. I'd like to be able to specify the output reduction prompts and to tweak the evaluation agents.

This is "just" multi-agent summarization and synthesis. Most summarizers are already doing this.

Nice thing that is this is open source, https://github.com/stanford-oval/storm

[1] https://www.phind.com/search?cache=z3qe9c0z6yb0x1hqbq64mrci

[2] https://www.perplexity.ai/search/please-summarze-and-explain...

firejake308 a day ago

I feel like this is the opposite of what LLMs are useful for. I like using LLMs to summarize and get an immediate answer to a specific question, like the AI-generated summary in Google Search. In that case, the increase in convenience outweighs the decrease in reliability. But if I wanted to read a full article about a topic, I would no longer be concerned about convenience, so I would look for a more reliable source than an LLM.

audiala a day ago

We use an approach inspired by this project to generate high level pages about city POIs such as this one: https://audiala.com/en/united-states/philadelphia/edgar-alla...

It's far from being perfect yet, sometimes too shallow and lacking a guiding thread, but after few iterations we believe it should offer all the information a visitor might need when planning a visit.

asterix_pano a day ago

I suppose that the fact that it's too shallow could be improved by applying this approach recursively on each sub-topic, then synthesise them and create a narrative around them.
- audiala a day ago
  
  Indeed, but as you increase the complexity, you increase the chance of failure, and increase the costs as well, even if those are quite minimal in comparison with the time a human would have to spend on this to do that manually.

dvt 2 days ago

I want to build this locally, I think it would be an absolute killer product. Could also consider doing an internet "deep dive" where the agent would browse for maybe 1-2 hours before sorting & collating the data. Add multi-modality for even more intelligence-gathering.

ukuina 2 days ago

This is a neat idea. DEVONagent, but actually agentic.
https://www.devontechnologies.com/apps/devonagent

andai 2 days ago

Fascinating. Last summer, inspired by AutoGPT, I made a simple Python script that does a web search for a query and uses that to answer the user's question. Looking at this I'm thinking, I could take the web results and ask it to reformat them in the style of Wikipedia, and wondering how that compares.

(I built it because ChatGPT couldn't search the web yet. When Phind launched a few weeks later, my project was basically obsolete!)

It seems the main improvement this paper has over that naive approach is the quality of the inputs, i.e. using "trusted sources" rather than random web results. (They appear to get their sources from Wikipedia itself?)

I'm not sure how much value all the other steps in the process add though.

mrkramer a day ago

Funny enough few months ago I was thinking how would Wikipedia written by AI look like. Imagine automating writing of knowledge so humans don't have to crawl the Web , books and papers to write knowledge articles.

tikkun a day ago

Overview: https://storm-project.stanford.edu/research/storm/

thebuguy 2 days ago

for the sources it used the crappy AI generated websites that pop up in the first page of google

mikewarot 2 days ago

I kept tripping some sort of "don't give it direct instructions" filter, but it gave some interesting results. I asked it, multiple times, about my BitGrid (which it actually read about, and included!), FPGAs, LUTs and energy usage. It kept talking about the problems with the technology, and the need for specialist to program it, and environmental impacts.

I did discover a dearth of published information about just how much energy a 4x4 LUT requires per cycle, and it's idle power.

rrr_oh_man a day ago

I don't know, remindes me of GPT o1:

Lots of text, lots of headers, but extremly shallow.

kk58 a day ago

Doesn't even work. 500 error

toddmorey a day ago

I've gone from thinking AI would only further degrade our sources of news and information to thinking perhaps AI is the only thing that can help combat misinformation.

Everything is so siloed & biased now, it's hard to find any presentation of a topic from a source that has no agenda. AI to help surface, aggregate, and summarize real data, expert opinion, and analysis like this would be really powerful and much needed. Essentially on-demand wikipedia articles held to the same editorial standards. Wikipedia isn't perfect by far, but their model has been surprisingly successful considering the challenge.

jedberg 2 days ago

Did anyone figure out how to share the article after you generate it?

ssalka 2 days ago

I'm guessing this just isn't implemented yet. It feels like a very alpha-stage project, when I sign in with another account and use the URL from my previous session, it tries generating the article again, but seems to be hanging. Also, my 2nd account is unable to view the Discover page: a 403 error in dev tools says "User has not accepted consent form"
I would think sharing by URL should work, but has some bugs with it currently.
- jedberg 2 days ago
  
  Same experience. Tried sharing by URL and had the same issues you did.
philipkglass 2 days ago

I think that you have to download the PDF and upload it to your own site.
They have a "Discover" page with previously generated articles, but I think that they have some sort of manual review process to enable public access and it's not updated frequently. The newest articles there were from July. I tried copying the link for a previously generated article of mine and opening it from a private browser window but I just get sent to the main site.
nipponese a day ago

The URLs for the generated articles are unique and persistent.

canadiantim 2 days ago

Black text on dark grey background not the most ideal for main hero segment.

Also would love a way to try without google authentication

BeepInABox a day ago

Not to be confused with Storm the language https://storm-lang.org/index.html

theanonymousone a day ago

I was impressed at the beginning, but then disappointed seeing hallucination in easily-verifiable historical information :(

globular-toast 2 days ago

What's the point of the "elaborate on purpose" box? It makes you fill it in but doesn't seem to affect the article, at least not that I can tell.

mrpf1ster 2 days ago

Probably just metadata about the request for the researchers at Stanford

kylebenzle 2 days ago

[flagged]

philipkglass 2 days ago

Did you try it? It appears to use AI for search and summary but not for a foundational knowledge base. I asked it about a niche topic and I got a very useful encyclopedia-type report that included real links to published research. This is a topic that I have previously spent a lot of time on in Google Scholar where I had to skim and reject a lot of false positives that show up with simple keyword search.
Just like on actual-Wikipedia you should read the linked references and not completely trust the body text, but also like on actual-Wikipedia the majority of the report's text seemed aligned with the content of the linked references.
- kylebenzle a day ago
  
  Yes, I did try it on two topics and it failed both. Guess thats why I was actually annoyed with it probably.