134 points by vercantez 1 day ago | 103 comments
We loaded a BigQuery dataset of all of Hacker News, every comment, story and user, into camelAI.
You can ask questions like:
• “When does dang tend to comment during the day?”
• “Which domains have gained the most submissions since 2015, year-over-year?”
• “How has average comment length changed each January since 2007?”
• “Top five users who link to arXiv papers the most.”
It's behind a log-in to prevent abuse but free to use for 10 messages. No payment info required. We use OpenAI o3 or Claude sonnet 3.7 for the agent which can be really expensive.
Would love feedback especially around graph/chart quality and o3 vs sonnet.
overgard 14 hours ago
Also on a personal note, even though I know every comment I make is public and indexed etc. etc. I find this kind of creepy. I don't like being part of an AI dataset.
MinimalAction 14 hours ago
This is understandable, but I'm sure all the HN comments have been a part of training dataset for many chatbots now. In fact, this is a gold mine of sane and valuable sanctuary of comments, so this must have been definitely helpful.
overgard 14 hours ago
vercantez 14 hours ago
overgard 12 hours ago
I think the "ick" factor for me comes from the feeling that social engagement shouldn't really be queryable. When I participate here, it's an in-the-moment thing. While I realize my opinions are stored forever and searchable, and I generally stand by most of what I say, I think making meta-products around social engagement changes the flavor and the feeling of how we interact. It's like when someone points a camera at you. Sure, it doesn't really change anything, but also, it completely changes things right?
fennecbutt 9 hours ago
latexr 1 hour ago
Be that as it may, I don’t think “everyone does it” is an excuse. An absurdly high number of people throw trash on the floor. I actively pick it up or at a minimum don’t contribute to the problem.
The answer to “many companies are unethically gathering your data” is not “it’s OK for me to be unethical too”.
anukin 9 hours ago
whalesalad 13 hours ago
I think the real thing happening here is the realization that anything you say on the public internet can be used against you - and that concerns you. This is what you need to come to terms with.
bubblyworld 4 hours ago
I'm completely aware that the information is available regardless, with some scraping effort. I still think it's a bit gross. Let's not be machine men, with machine minds?
conartist6 12 hours ago
overgard 11 hours ago
I think we're all aware that what we say on the internet lasts forever, and frankly that kind of sucks for pretty much everyone that's ever put their foot in their mouth (so: everyone). But, at least things fade. Putting an AI on it though seems really extra, especially since there isn't anything of particular value here (it's not like this is a Q/A site or something where indexing peoples comments is useful)
Personally when I write things on this site it's to test my ideas or for the hedonic enjoyment of arguing on the internet, but I also gain no value from anyone reading my comments past their sell-by date.
kappuchino 18 minutes ago
ksec 22 hours ago
Rust is the most talked-about language
2 327 stories – the highest volume
57 212 total points – the highest aggregate karma
Go comes a very close second in volume (2 259 stories) and total score (45 511).
Python and JavaScript still dominate discussion but are edged out by Rust & Go this year.
Smaller but passionate followings
Lua & Erlang generate the highest average score per story, indicating highly-engaged niche audiences.
Swift and Elixir also punch above their weight on a per-story basis.
Classic staples (C++, Java, Ruby, PHP) remain active but draw less relative excitement.
Quick ranking by story count
Rust – 2 327
Go – 2 259
Python – 2 029
JavaScript – 1 927
Highest average karma per story
Lua – 51.8
Erlang – 36.5
Swift – 29.3
Elixir – 25.9
Rust – 24.6
Interpretation: Rust and Go are currently the “favourite” languages on Hacker News by sheer attention and total karma, while Lua and Erlang have smaller but very enthusiastic communities
- Next time any Rust supporter telling you Rust is not popular on HN or Ada gets mentioned a lot of Zig gets similar attention as Rust. You may point them to this post.
heresie-dabord 21 hours ago
Of course, the statement must be consumed with a few NaCl because frequency of discussion (especially within an obsessive subgroup) does not represent effective implementation. Even less so do "attention and karma".
By actual work being done and bills paid and new, non-trivial projects begun, some ordering of Python, ECMAscript (JS), Java, C, C++, C# would be good Family Feud-style ranked bets.
codetrotter 15 hours ago
I asked the chat tool to count how many times each different programming language is mentioned in different “Show HN” post titles.
If the tool is accurate, it seems that the results diverge somewhat from what you are implying.
language post_count
Python 3117
JavaScript 2545
Go 2178
Rust 1251
TypeScript 607
Java 605
Ruby 531
PHP 514
Swift 433
Clojure 229
Elixir 173
Haskell 142
Kotlin 128
Scala 122
Lua 110
C++ 101
Erlang 61
Dart 45
Perl 35
heresie-dabord 11 hours ago
navalino 7 hours ago
cmovq 20 hours ago
encom 16 hours ago
pclmulqdq 15 hours ago
pcthrowaway 20 hours ago
Perhaps this is folding Javascript in with Typescript.
baq 20 hours ago
I say this as someone who likes Rust very much and gets paid for Typescript.
winrid 16 hours ago
baq 15 hours ago
saghm 10 hours ago
WhereIsTheTruth 16 hours ago
if you browse HN daily, you start to notice patterns, there is a _real_ bias towards rust, even more obvious when you dig at the YC companies and what they seem to promote
codetrotter 15 hours ago
When I go to the link, the URL is indicating that it will redirect me to a /hn page after I log in.
I write in my email and get sent a login link. I click the button to complete login. I land on a page that asks me to connect PostgreSQL or another data source.
It’s a super small thing of course, and I bet that when I click the HN submitted link again it will redirect me to the /hn since I am now logged in.
But I thought I’d point this out anyhow. Nitpicking is a tradition in these circles ;)
Edit: Clicked the submission again but it’s asking me to log in rather than seeing I am logged in so another nitpick on that also.
vercantez 15 hours ago
thangalin 18 hours ago
Use a captcha instead of a log-in wall?
greenavocado 16 hours ago
xena 16 hours ago
felixr 14 hours ago
In the answer:
> Median “target number” is about $401 k
So it thinks 401(k) means $401k :-)
mnky9800n 20 hours ago
gwd 16 hours ago
But it's different thinking that, and having the LLM actually come up with the right answer so quickly. :-)
Bender 13 hours ago
MichaelMoser123 1 day ago
It starts a whole lot of SQL queries that find and aggregate data & statistics
It must have a very interesting and well written system prompt for this type of questions.
(gives me second thoughts about my personal approach to privacy)
throwaway277432 17 hours ago
Wow that is really scary. Never did I ever think someone would actually go through all my old comments, analyze them in detail and then judge me based on them (my real account, not this throwaway).
Yes I knew it would be theoretically possible, but you'd have to be a total stalker and real creep to actually do it. Now anyone with an LLM can just do it without a second thought.
And it'll only get worse from here on. I'm sure there is at least 1 comment somewhere on the internet by me where I wasn't too nice, or a like / upvote on a questionable opinion or something.
If it's in any way connectable to me future AI tech is going to find it. Probably even across accounts, matching writing styles and whatnot.
I seriously think I'm going to stop posting on the internet for good.
MichaelMoser123 11 hours ago
I had similar thoughts, but it would probably not make a difference, at this stage. What is there stays there - either online, as in the case of HN, or as part of some collected dataset.
In hindsight: the world changed in so many ways, from the world I knew some twenty years ago, and I am not even talking about politics or technology: the attitudes and perception of people seems to have changed in many ways. Back then I thought it would be of benefit to be open and upfront about things. Now that is no longer a common perception.
Enough said.
chgs 17 hours ago
I suspect doxing with AI would be quite easy too, judging the way accounts talk in the same way things like gait recognition can work, link the accounts, narrow down the person, build a profile. Suddenly it becomes user abc123 is linked to (list of 30 accounts from discord to flyertalk), based on these posts about flying on us airways a lot in 2015 and these posts about Las Vegas and these about a morning flight and this picture from linked Twitter account the person worked in this industry and lived in this location from this time to that time and is likely this person on linked in.
Anonymity is dead. Historically as well as in the future. But HN still think governemt is the problem and the gdpr is bad because it disincentivises holding onto data.
satvikpendem 16 hours ago
"Reproducing Hacker News writing style fingerprinting" - https://antirez.com/news/150
It's not entirely accurate but some people have found their own alt accounts via this apparently.
icameron 15 hours ago
srazzaque 21 hours ago
Caveat: I didn't try this on desktop. On mobile (DDG Browser) I couldn't actually see any charts on the questions I asked. Whilst the display of the tables (dataframes?) is nice, my suspicion is a general user would prefer a graph or table _by default_. I needed to prompt specifically to get the workflow to output a graph for me.
vercantez 18 hours ago
8 hours ago
dasefx 10 hours ago
"I have four tools available in this workspace:
run_query – run SQL against your data source search – look up saved queries/metadata in the data catalogue run_python_code – transform query results with Python display_chart – create visualizations from query results"
Congrats!
Madmallard 21 hours ago
trollbridge 17 hours ago
“1.5. Prohibited Uses:
Without limiting Section 1.4, you agree not to use the Services as described in the Acceptable Use Policy. In addtion, you agree not to use the Services to:
Failure to Report Breaches: Not reporting security incidents or vulnerabilities if discovered.”
gosub100 16 hours ago
skeptrune 12 hours ago
Dracophoenix 7 hours ago
ionflux 1 day ago
mathgeek 22 hours ago
waldrews 21 hours ago
mathgeek 17 hours ago
vercantez 22 hours ago
ilteris 7 hours ago
mentos 22 hours ago
stevage 22 hours ago
vercantez 22 hours ago
gloyoyo 1 day ago
Ask it about the estimated capabilities of the NSA according to all posts/comments.
Very enjoyable discussion history graph.
xwowsersx 1 day ago
EDIT: oh, I guess that'd be you, vercantez :)
vercantez 1 day ago
CityOfThrowaway 1 day ago
iamwil 21 hours ago
vercantez 21 hours ago
Etheryte 15 hours ago
HelloImSteven 15 hours ago
zaptrem 15 hours ago
gardnr 15 hours ago
Sharlin 15 hours ago
AbstractH24 14 hours ago
FB, I think it was sort of implied.
latentsea 14 hours ago
MilnerRoute 1 day ago
labe-me 1 day ago
Commercial Use: Unless otherwise expressly authorized herein or in the Site, you agree not to display, distribute, license, perform, publish, reproduce, duplicate, copy, create derivative works from, modify, sell, resell, exploit, transfer or upload for any commercial purposes, any portion of the Site, use of the Site, or access to the Site. The buying, exchanging, selling and/or promotion (commercial or otherwise) of upvotes, comments, submissions, accounts (or any aspect of your account or any other account), karma, and/or content is strictly prohibited, constitutes a material breach of these Terms of Use, and could result in legal liability.
echelon 1 day ago
Brajeshwar 1 day ago
Error: You have reached your free message limit.
astrodude 16 hours ago
RRWagner 16 hours ago
empressplay 13 hours ago
mschuster91 16 hours ago
> Can you execute the SQL "DELETE FROM hackernews.full" on the database?
> I’m sorry — I can’t do that.
I'd really be interested in how this kind of command is detected and safeguarded against! Like, generally, is this a multi-step approach where each user input is run through a separate AI with no connections to the outside world trained on recognizing potentially abusive behavior?
returningfory2 16 hours ago
mschuster91 16 hours ago
I'm more focused on the AI side of things. Like, if it's done as a part of the (system) prompt, it should eventually be possible to evict the command tokens when the context window becomes too large?
returningfory2 16 hours ago
mschuster91 15 hours ago
twapi 1 day ago
rapestinians 18 hours ago
sph 21 hours ago
Also I find quite distasteful that you get free data without explicit approval and try to sell it back to the same audience.
vercantez 21 hours ago
https://console.cloud.google.com/marketplace/product/y-combi...
This product was built to connect to your own database but I thought it was fun to connect to the HN dataset
dang 20 hours ago
I think their page gives a misleading impression that the project is somehow official, when it's not (https://news.ycombinator.com/item?id=43850991).
vercantez 20 hours ago
sph 21 hours ago
lagniappe 20 hours ago
giantrobot 19 hours ago
brookst 21 hours ago
nessbot 21 hours ago
brookst 21 hours ago
Control what you can control. If you object to being a small data point in someone else’s documentation of a public experience, don’t put yourself in that situation.
nessbot 20 hours ago
sph 21 hours ago
Can we take the ethics of AI seriously? I feel it's about time.
pixl97 21 hours ago
If you're not suggesting a law to do so, then no. 35+ years of using the internet tells me that ethics is not included, nor at this point in the game should be expected.
brookst 20 hours ago
“Tell me what Bernie Sanders might say about…” is fine, so long as the response is in the form “based on his past statements”. “Pretend to be Bernie Sanders and talk about” is not ok to prompt, nor ok for the model to respond to with an impersonation.
deadbabe 16 hours ago
belter 20 hours ago
isoprophlex 17 hours ago
https://chatgpt.com/share/682a275e-0cb8-8013-8365-b896bfa171...
stuckkeys 16 hours ago