Christof Maupin, Artist · @PastPresentArts
170 followers · 128 posts · Server mastodon.art
LousyHacker · @lousy
0 followers · 5 posts · Server infosec.exchange

Recently, I've been writing an tool which I call . Basically, Redglass pipes in as much of the internet as I can find, processes and tokenizes the data, then analyzes it for named entities. A named entity is basically just a person, place, or thing, and parsing text for named entities is a pretty entry-level task. What makes it interesting is the way I track entities as both statistical models and nodes in a larger graph.

From there, each entity which meets a few simple filtering rules can be queried to determine 1) the similarity of that entity to another entity, and 2) the nature of connections between it and other entities.

I mostly started building this because a friend needed help with a web scraper and it got me thinking about how to make use of large, uncurated datasets. In terms of actual use cases, I see it as a risk assessment tool ("How similar is the model which represents me to a model representing entities targeted by ransomware actors?", "How similar is the model to one representing entities which have recently faced harassment or physical threats?", Etc.) and as a way to do deep OSINT research without resorting to a hundred different tools and a lot of guesswork. Each source just needs a specialized collection agent, which I can usually whip up in a few minutes and is then permanently usable. ( babie)

By examining the connection between two entities (X references Y, both are referenced by Z, etc.), you can trace the spread of news and misinformation. An interaction graph allows you to casually understand large, organizations like online hate groups and cybercrime syndicates. Seeing how similar a given client's model is to one which was recently victimized can help an insurance company make underwriting decisions. In general, there are a lot of business applications I can see.

On the other hand, I'm frankly quite worried about how powerfully intrusive this kind of data processing can be, and the tool's still mostly a series of hacked together scripts. It's alarmingly easy to violate someone's privacy without ever meaning to. If you pipe data from darkweb sources (which this tool can do), then the models you see are going to be informed by data obtained illegally or immorally. It's possible to filter that out, but very hard to do so without blanket removing all .onion content, which defeats the point of integrating it and dramatically weakens the tool's ability to predict risks or map organizations.

Right now, I'm thinking of adding a filtering stage which removes data from the pipeline if it doesn't meet certain criteria (entity type, minimum public profile, etc.). That way the critical data can still get through, but the privacy-violating stuff gets dropped before any human can lay eyes on it.

Anyway, holy hell this is a long post. I've been working on the tool off and on for a few months, but this is the first time I've sat down to really describe it. Maybe this will turn into something, maybe it won't. Either way, it's been fun to build, and collecting this much data from this many sources has taught me more about data science than I ever expected to know, so I call this thing a win.

What are you working on? Feel free to nerd out about it in the replies.

#osint #redglass #naturallanguageprocessing #scaling #distributed #python

Last updated 2 years ago