Hacker Remix

Show HN: Data Formulator – AI-powered data visualization from Microsoft Research

200 points by chenglong-hn 1 day ago | 34 comments

Creating data visualizations with AI nowadays often means chat, chat and more chats...and writing long prompts can be annoying while they are also not the most effective way to describe your visualization designs.

Data Formulator blends UI interaction with natural language so that you can create visualizations with AI much more effectively!

You can:

* create rich visualizations beyond initial datasets, where AI helps transforming and visualizing data along the way

* iterate your designs and dive deeper using data threads, a new way to manage your conversation with AI.

Here is a demo video: https://github.com/microsoft/data-formulator/releases/tag/0....

Give it a shot and let us know how it looks like!

d_watt 13 hours ago

Some of the metaphors for interacting with the models, and visualizing as threads are interesting. Definitely does a good combination of ease of prompting + interogatability of the generated code.

I quickly ran into a wall trying to do interesting things like "forecast a dataset using ARIMA." On the surface it just does a linear prediction, seeming to ignore me, but under the hood you can see the model tried importing a library not actually in my environment, failed, and fell back to linear.

Given that you're approaching this in a pythonic way, not sql, my default way of working with it is to think about what python stuff I'd want to do. How do you see handling these things in the future. Go the route of assuming anaconda, and prompt the model with a well known set of libraries to expect? Or maybe prompt the user to install libraries that are missing?

chenglong-hn 13 hours ago

That's a cool example! You are right, GPT-4o is much more powerful than we allow it to perform in Data Formulator, and our current design is to restrict it to a point that the model behavior is more or less reliable.

While we design it targeting more end-user analysts scenarios (thus much simpler UI and function support), we see the big value of "freeing" GPT-4o for advanced users who would like to have AI do complex stuff. I guess a starting point could be having an "interactive terminal" where AI and the user can directly communicate about these out of the box concepts, even having the user instruct AI to dynamically generate new UI to adapt to their workflow.

paddy_m 16 hours ago

I have a tool [1] that is tackling some of the same problems in a different way.

I had some core views that shaped what I built.

1. When doing data manipulation, especially initial exploration and cleaning, we type the same things over and over. Being proficient with pandas involves a lot of recognition of patterns, and hopefully remembering one with well written code (like you would read in Effective Pandas).

2. pandas/polars is a huge surface space in terms of API calls, but rarely are all of those calls relevant. There are distinct operations you would want on a datetime column, a string column or an int column. The traditional IDE paraidgm is a bit lacking for this type of use (python typing doesn't seem to utilize the dtype of a column, so you see 400 methods for every column).

3.It is less important for a tool to have the right answer out of the box, vs letting you cycle through different views and transforms quickly.

------

I built a low code UI for Buckaroo that has a DSL (JSON Lisp) that mostly specifies transform, column name, and other arguments. These operations are then applied to a dataframe, and separately the python code is generated from templates for each command.

I also have a facility for auto-cleaning that heuristically inspects columns and outputs the same operations. So if a column has 95% numbers and 1% blank strings, that should probably be treated as a numeric column. These operations are then visible in the lowcode UI. Multiple cleaning methods can be tried out (with different thresholds).

[1] https://github.com/paddymul/buckaroo

[2] https://youtu.be/GPl6_9n31NE?si=YNZkpDBvov1lUYe4&t=603 Demonstrating the low code UI and autocleaning in about 3 minutes

[3] There are other related tools in this space, specifically visidata and dtale. They take different approaches which are worth learning from.

ps: I love this product space and I'm eager to talk to anyone building products in this area.

chenglong-hn 13 hours ago

this is really really cool! directly working with table is sometimes the only way to clean the data as well :)

I wish multiple ways of interacting with data can co-exist seamlessly in some sort of future tool (without overwhelming users (?)) :)

paddy_m 13 hours ago

To your point about LLM based approaches have the huge adoption advantage in that you don't need to understand a lot to write into a text box.

A tool like buckaroo requires investment into knowing where to click and how to understand the output intitially.

zurfer 20 hours ago

Anthropic recently released something that looks more polished but follows the chat paradigm. [1]

As a builder of something like that [2], I believe the future is a mix, where you have chat (because it's easy to go deep and refine) AND generate UIs that are still configurable manually. It's interesting to see that you also use plotly for rendering charts. I found it non-trivial to make these highly configurable via a UI (so far).

Thank you for open sourcing so we can all learn from it.

[1] https://news.ycombinator.com/item?id=41885231 [2] https://getdot.ai

flessner 16 hours ago

The future in this space will probably stick to what IDEs have done from the beginning: Leaving the "core platform" unchanged while providing additional AI powered features around it.

Microsoft Office, VS Code, Adobe Photoshop and most other large software platforms have all embraced this.

I have genuinely not seen an AI product that works standalone (without a preexisting platform) besides chat-based LLMs.

zurfer 20 hours ago

Here is the link to one of the prompts. It seems like all the LLM tasks are in the agents directory: https://github.com/microsoft/data-formulator/blob/main/py-sr...

Some of these "agents" are used for surprising things like sorting: https://github.com/microsoft/data-formulator/blob/main/py-sr... [this seems a bit lazy, but I guess it works :D]

chenglong-hn 12 hours ago

you find it! you can't imagine how often I'm annoyed when see April being ranked before March due to alphabetic order...

Thus the sorting agent, and now running by default in the background!

DeathArrow 19 hours ago

If you look in the video from OP, you can see that chat is still used at some point.

chenglong-hn 12 hours ago

yes! chat is still a necessary component, since it's sometimes the only way for us to communicate unstructured information to the system.

zurfer 18 hours ago

hmm there is a follow up to show the difference in percent instead of absolute values, which is similar to the type of interaction you can have in chat and there is a sort of history on the left side, so things are chat like to some degree.

goose- 23 hours ago

Since Data Formulator performs data transformation on your behalf to get the desired visualization, how can we verify those transformations are not contaminated by LLM hallucinations, and ultimately, the validity of the visualization?

larodi 22 hours ago

We can’t. Without the driver this car runs on probability. And that all. A capable operator is still needed in the loop.

DeathArrow 19 hours ago

You can see the generated code.

croes 17 hours ago

Do you think the people who this is made for can grasp the code?

chenglong-hn 13 hours ago

this is constant challenge! code is the ultimate verification tool, but not everyone gets it.

sometimes reading charts help, sometimes looking at data helps, other times only code can serve the verification purpose...