225 points by popcar2 6 months ago | 84 comments
I've been fed up with search results so much that I decided to make a giant blocklist to remove garbage links by using uBlacklist.
I browsed other blocklists and wasn't very satisfied from what exists now; the goal of this one is to be super organized and transparent, explaining why each site was blocked via issues. Contributions welcome!
Even though around 100 domains are blocked so far, I already noticed a big improvement in casual searches. You'd be surprised how some AI generated websites can dominate the #1 page on DuckDuckGo.
cormorant 6 months ago
The problem seems worse on "alternative" search engines, e.g. DuckDuckGo and Kagi, which both use Bing. It's been driving me back to Google.
A blocklist seems like a losing proposition, unless, like adblock filter lists, it balloons to tens of thousands of entries and gets updated constantly.
Unfortunately, this kind of blocklist is highly subjective. This list blocks MSN.com! That's hardly what I would have chosen.
popcar2 6 months ago
> Unfortunately, this kind of blocklist is highly subjective. This list blocks MSN.com! That's hardly what I would have chosen.
It's definitely a bit opinionated, but it's open to discussion - you can create an unblock request issue (if you care enough to do so, of course!). The reason I blocked MSN is that it just re-hosts articles from other websites, so I'd rather see the official source than be tricked into Microsoft's site which is very annoying, like how it opens another article if you scroll too fast down.
maximilianthe1 6 months ago
radicality 6 months ago
As a Kagi user I actually haven’t encountered much search result spam, surprised you’re seeing enough there to drive you back to Google!
rendaw 6 months ago
I'd block them but there seem to be infinite. They're probably buying 10+ character domains using random words/names/phrases in bulk.
econ 6 months ago
BigGreenJorts 6 months ago
I'm wondering how much the blacklist can be broken down into categories of spam. Sponsorblock for YouTube has a lot options around the types of things it'll skip through and the user has choice in how they're handled (skipped automatically, prompted to skip, simply highlighted in the scrubbar) at the category level.
nosioptar 6 months ago
I'm loving being able to search for something without getting results from garbage sites like howtogeek, stackoverflow, MSN, Pinterest, etc.
Llamamoe 6 months ago
Ringz 6 months ago
Another great function (not for this plugin) should be the option to "bundle" all search results from the same domain. Stuff them under one collapsible entry. I hate going through lists and pages of apple/google/synology/sonos/crab urls when I already know that I have to search somewhere else.
troyvit 6 months ago
The upside is that it would go beyond your browser to anything on your machine that makes a DNS request.
> Another great function (not for this plugin) should be the option to "bundle" all search results from the same domain. Stuff them under one collapsible entry.
That would be really cool. Just zip it up if you don't want to see that domain for that specific search.
LeoPanthera 6 months ago
It ironically makes me think of the Yahoo Web Directory in the 90s.
Time is a flat circle.
manx 6 months ago
OlivOnTech 6 months ago
dredmorbius 6 months ago
Power-law relations mean that a small number of domains will account for the lion's share of low-relevance results, and filtering those out will result in dramatic improvements in relevance.
That small set is probably fairly dynamic, however, and will likely change at a fairly high rate over time.
Penny-ante sites are less likely to appear in generic results, but might well be whatever the spam/phish term is for junk general Web search results.
We may well come to rely more on whitelisting, but I think at least for now that's not necessary, largely due to the dynamics of publishing / attention economies themselves.
antithesis-nl 6 months ago
Not saying you should, just that you could...
popcar2 6 months ago
antithesis-nl 6 months ago
On clicking it, uBlock blocked my visit, but that may or may be not enough for you, in which case an additional plugin may be warranted.
gtfiorentino 6 months ago
popcar2 6 months ago
You might be interested in the AI spam/low effort section though, one that tops DDG often are these AI generated tech articles: https://github.com/popcar2/BadWebsiteBlocklist/issues/1
They're the same site under different domains, you can tell it's AI by its writing style, how much they churn out per day, how little info there is about who's writing it, how similarly the about pages are written, and how the same article is suspiciously also in similar-sounding sites.
Another one I just caught today that was on top of page 1: https://github.com/popcar2/BadWebsiteBlocklist/issues/84
I'll be sure to report these sites as I'm adding to the list, thanks.
gtfiorentino 6 months ago
nashashmi 6 months ago
james-bcn 6 months ago
I may do that.
freedomben 6 months ago
Although, using this via the extension would make it cross-platform so the block affects kagi and google, which could be nice.
Although, that would require manual syncing between devices, which would not be nice.
Although, uploading it to kagi through API doesn't mean I have to not use the extension, so having the cake and eating it too may be possible.
thoughtpalette 6 months ago
shortformblog 6 months ago
It’s the same reason why social media blocklists can be problematic—everyone’s calculus is different.
My suggestion is that you promote it as a starter and suggest that users fork it for their own needs.
swayvil 6 months ago
It could be simple.
Good?
shortformblog 6 months ago
manx 6 months ago
edm0nd 6 months ago
also works well with Pi-hole and other platforms.
https://github.com/spmedia/Crypto-Scam-and-Crypto-Phishing-T...
the_snooze 6 months ago
bityard 6 months ago
DuckDuckGo has site blocking. The problem is that there are so many SEO-optimized blogspam, referral link, and other "garbage" sites that you could spend a lifetime blocking each one individually before you get any actual work done. And it's only getting worse now that LLMs can generate a whole web site for you in a matter of minutes. I imagine a dedicated individual could provision several thousand websites/blogs per day, just chock full of ads and referral links.
Kuinox 6 months ago
- For example, kaspersky blog doesn't look bad.
- CCleaner blog is just a list of update.
popcar2 6 months ago
owenthejumper 6 months ago
This looks like someone's personal list not a serious effort.
popcar2 6 months ago
There are quite a few company blogs I haven't blocked, mainly ones that are actually informative and aren't trying to trick you into looking at their products.
Llamamoe 6 months ago
I considered this previously. I feel like the web would be a vastly improved experience if you just blocked everything affiliated with a corporation as opposed to a university, nonprofit, or a personal site.
Timwi 6 months ago
That is indeed something I'd want.
> This looks like someone's personal list not a serious effort.
It is the OP’s personal list and they were completely open about that.
HackerThemAll 6 months ago
This has just started. Instead of whining, contribute, the more people contribute, the more "serious effort" it will become.
Kuinox 6 months ago
jwx48 6 months ago
MortyWaves 6 months ago
CamperBob2 6 months ago
bluetidepro 6 months ago
nayuki 6 months ago
She talks at length about how pages of AI-generated nonsense text are cluttering search results on Google and all other search engines.
huesatbri 6 months ago
ColdTakes 6 months ago
dylan604 6 months ago
szszrk 6 months ago
It's more a matter of whom do you trust. Private mode in browsers still gathers unique user IDs, fingerprinting is widespread and fairly precise. The "logged in" part doesn't change that much.
ColdTakes 6 months ago
Night_Thastus 6 months ago
mrweasel 6 months ago
mrbluecoat 6 months ago
popcar2 6 months ago
For example: https://www.msn.com/en-us/movies/news/jodie-foster-heckled-a... is just a re-hosted version https://www.independent.co.uk/arts-entertainment/tv/news/jod...
My hope in hiding MSN is to allow the original sources to rise back up to the top.
roskelld 6 months ago
qingcharles 6 months ago
But I have archive.is for the most part to get around that issue.
qingcharles 6 months ago
troyvit 6 months ago
lambdaone 6 months ago
The scalability comes from the caching inherent in DNS; instead of having to have millions of people downloading text files from a website over HTTP on a regular basis, the data is in effect lazy-uploaded into the cloud of caching DNS resolvers, with no administration cost on behalf of the DNSBL operator.
Reputation whitelists (or other scoring services) would also be just as easy to implement.
bityard 6 months ago
noleary 6 months ago
Some sites are complete garbage and should be blocked, for course. Others (e.g., in my experience, Quora) are sometimes quite good and sometimes quite bad. Wouldn't be my first choice, but I've found them useful at times.
For a given search, maybe you try with the most aggressive blocking / filtering. If you fail to find what you're looking for, maybe soften the restriction a bit.
Maybe this is overwrought...
QuadrupleA 6 months ago
SEO spam and AI slop are easily spotted on the human level. Google has hundreds of thousands of employees. Just put ONE of them on this f**ing job!
It's criminal what these companies have let happen to the web.
Llamamoe 6 months ago
Have you paused even for a moment to think about what would happen to the poor shareholders if they put that one dude on this job..?
ge96 6 months ago
I use a VM in other scenarios but even that, properly separated?
theoreticalmal 6 months ago
6 months ago
miyuru 6 months ago
loa_observer 6 months ago
dmix 6 months ago
swayvil 6 months ago
Do you have a forum where you discuss prospective contributions etc?
popcar2 6 months ago
swayvil 6 months ago
popcar2 6 months ago
batata_frita 6 months ago
renegat0x0 6 months ago
mediumsmart 6 months ago
lubujackson 6 months ago
Animats 6 months ago
6 months ago
qiine 6 months ago
purpleinfs 6 months ago
wetpaws 6 months ago
sandropuppo 6 months ago
popcar2 6 months ago
verdverm 6 months ago
fn-mote 6 months ago
If I have to piece together multiple SO answers, the issue is complex enough that I better actually understand it. I am not at the point where I am trusting an LLM for this.
> LLM chat can [...] answer in terms of my specific variable names
Which has value 0 for me. What are you doing that this is an asset? Generating a huge block of code? Write a function!
Edit: in fact, parent is the author of a complex configuration managmeent tool (see profile) so getting a big block of code regurgitated with the correct variable names is probably an asset for them.
verdverm 6 months ago
I understand the concepts, it's not complex, but it's something I don't use or do daily. One of the other differences with working with LLMs over search is that I can provide a lot more input as part of my query. That context is often used within the answer, a much better experience than having to map multiple other examples onto mine.
Also, I am not the author of a complex configuration management tool. Not sure what you are misreading. I have authored a deterministic code generation tool, maybe that is what you mean? It however is an alternative to LLM code generation that existed prior to the current LLM hype cycle
If you don't like LLMs, that is totally fine. You don't need to put down other people's usage or question why or how they get value out of it with feigns. Perhaps you might consider spending more time with the new knowledge tools that will not be going away. I just tried out the new Gemini Research assistant with the query...
""" google has an open source project that enables pull requests to be embedded into a git repository and also comes with a gui. Can you help me find this project? """
It took a couple of minutes and came back with exactly the project I was looking for. Saved me a bunch of time and headache trying to find this: https://github.com/google/git-appraise
I didn't have to know the exact search words or phrases, I didn't have to filter through multiple search results. I worked on this post while my LLM assistant did it for me