FedSearch - Federated network search engine

Doc Edward Morbius ⭕ · @dredmorbius

2083 followers · 14674 posts · Server toot.cat

@alcinnz So, effectively a filetype:application association manager. file(1) and magic(5) on steroids.

I am thinking of managing metadata associated with documents, works (multiple forms / manifestations of a single document), projects and workflows (involving various records, etc), and the overall document lifecycle: creation, acquisition, cataloguing, use, adaptation, distribution, destruction.

That's what I've lumped under my #webfs and #docfs concepts, along with #kfc (Krell Functional/Fucking Context).

#webfs #docfs #kfc

Last updated 3 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2083 followers · 14674 posts · Server toot.cat

@billjanssen Thanks again. Some of that looks ... closer. Cone Tree and Perspective Wall most so, though still not quite there.

Are you associated with this research/develpment, or just an interested party?

One thing I've thought about considerably as I'm increasingly using e-book readers and being frustrated by their own document management / organisational limitations, is how physical library space maps, with multiple dimensional convulutions, to stored data:

There's a mix of physical and logical organisations:

character -> word -> line -> page - > signature -> book

character -> word -> sentence -> paragraph -> chapter -> book

Shelf -> bookcase -> aisle -> floor -> building

A book (nominally: 250 pages) is about 125k words.

About 32 books fit to a shelf, 8 shelves to a bookcase, say, 16 bookcases to an aisle, 16 aisles to a floor. (I'm biasing to powers-of-two numbers here)

That's 256 books per case, 4,096 per aisle, 65,536 per floor.

(A fairly large community library is on the order of 300k books, or about 4 floors as I've defined them. A large university library, 122 such floors. Based on my experience, I may be underspecifying density, and would be interested in actual data.)

And so on.

The point I'm trying to make though isn't about density but of navigation of that space. The reader/researcher can go to a specific book, or to a shelf (closely related works), an aisle, a floor, etc. There's a different level of aggregation at each point in the scale, and for topically-organised (e.g., Library of Congress classification or Dewey Decimal), a specific region corresponds largely with a specific subject grouping.

On my e-book reader, I'm effectively limited to only one level of aggregation: a sequential shelf scan of books. With storage exceeding several TB, and an average book size of ~1--5 MB, that's effectively a fairly large community library worth of potential documents which can be carried in one's hand or satchel, but for which the organisational capabilities are ... exceedingly limited.

This remains a major frustration of mine.

#KFC #DocFS #WebFS #Libraries #DocumentManagement

#kfc #docfs #webfs #libraries #DocumentManagement

Last updated 3 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2083 followers · 14674 posts · Server toot.cat

@Researchbuzz The proximity element is limited as I am, of course, on Altair IV, some 20 of your light years away.

That said, one of my obsessions (though not necessarily a major element of my Mastodon tooting) is information, knowledge, and document management.

The tags #kfc, #webfs, and #docfs will lead to a few of my information-management / search toots / threads.

And if you've got opinions, feelings, and/or deep intel on #PaulOtlet and his #Mundaneum I'm all ears.

@woozle

#kfc #webfs #docfs #PaulOtlet #mundaneum

Last updated 3 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2082 followers · 14677 posts · Server toot.cat

Open media

I see your open-plan office AND RAISE YOU!!!

The Central Social Institution of Prague. It’s apparently still in operation.

https://www.vintag.es/2020/01/central-social-institution-prague.html

#CentralSocialInstitution #Prague #Czechia #DataStorage #InformationManagement #KFC #DocFS #WebFS

#CentralSocialInstitution #prague #czechia #datastorage #informationmanagement #kfc #docfs #webfs

Last updated 3 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2082 followers · 14677 posts · Server toot.cat

@natecull There's a strong element of the ideas that I'm playing with in #kfc which is oriented around this.

The principle interfaces would be the filesystem and shell. (The filesystem would be strongly document-oriented: #docfs & #webfs.)

And utilities would act on those documents through workflows, projects, groups, tasks, etc.

Whether or not that fits what anyone actually wants to do is ... another matter.

@TerryHancock

#kfc #docfs #webfs

Last updated 4 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2082 followers · 14677 posts · Server toot.cat

@jonny My principles here are:

The filename should be descriptive and not simply unique.
It should be human-meaningful in some manner if at all possible.
It should scope to the collection size / namespace.

Estimates I'm aware of are that there are on the order of 100--200m books ever published, growing at ~1m year, and a generally comparable set of scientific articles. News organisations such as Reuters, AP, and AFP produce about 1k--5k items daily, and I suspect many of those are photos or videos. Major newspapers tend to produce about 100--500 stories daily (weekday vs. weekend). You can work out ballpark maths from that.

For correspondence, the originator and recipient ("From:" and "To:" are both significant. Those might be referenced. Publishing, to a general audience, is in a sence correspondence where "From:" == Author and "To:" == World.

The filename need not be precise, exact, or an accurate presentation of conents, but USEFUL. That is, within a corpus, can I find a specific work or works of interest. In this sense, the titling scheme is an example of the principle I've developed that search is identity, in the sense that a search might produce 0, 1, or n>1 results. 0 is null, 1 is identity, and > 1 is a result set.

There are other naming and cataloguing schemes. A complete system would have correspondences between these and the conventional / human-readable titles, e.g., ISBN, LOCCS, OCLC, DOI, etc.

And yes there are other cataloguing systems such as SuDoc (used by the US government) which are useful in their own contexts.

Author, date, content, audience, and publisher are generally useful search-space reducing concepts of fairly generally applicable context. E.g., if I were including, say, store receipts or purchase orders, the vendor, customer, date, location, and a summary of contents (say, largest item) a description. Computer logs tend to be time and process/service oriented, perhaps also mentioning user or network address, etc.

Related hashtags and discussion:

#docfs #webfs #KFC #PaulOtlet #Maundenaum

#docfs #webfs #kfc #PaulOtlet #maundenaum

Last updated 4 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2082 followers · 14677 posts · Server toot.cat

@Valenoern This is the essential idea behind "docfs", which would be a document-oriented filesystem. Its networked sibling being "webfs".

"Document" here is in the sense of #PaulOtlet, of any durable record. That might be a text, image, sound, video, multimedia content, data, software, or an amalgamation or melange.

One of my key ideas is that the metadata for these documents would be part of the filesystem, extending the notion of what constitutes file-centric data. I'd like to see some form of bibliographic data presented, where available for public and published media (book, articles, audio recordings, films).

Search is another element, and one idea for the filesystem would be as a virtual filesystem in which attributes could be supplied until a single item matching those criteria was found. "Identity is search".

For projects, some concept of structured workflows, with groups, tasks, milestones, and contributing data. For a sufficiently structured organisation, security and access controls.

I'd like the whole concept to be as commercialisation-hostile as possible, with both copyrights and payments entirely out of scope.

#docfs #webfs #kfc #maundenaum #DublinCore #metadata #bibliography #Plan9OS #Schopenhauer

#PaulOtlet #docfs #webfs #kfc #maundenaum #dublincore #metadata #bibliography #plan9os #schopenhauer

Last updated 4 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2082 followers · 14677 posts · Server toot.cat

@CyberpunkLibrarian I'd very much like that.

I've been half-assedly kicking around an idea to build such a thing, generally referred to as #KFC (Krell Functional Context / Krell Fucking Context, variously). See also #WebFS and #DocFS which relate: accessing the Web as a filesystem (see Plan9OS) and a documents-oriented filesystem in which "paths" are actually "search queries" through various spaces (author, title, pubdates, subjects / keywords, publishers, identifiers ISBN/OCLC/LOCCN/DOI, etc).

The results of any path specification are strictly one of:

No results (a failed search).
One result (an identity search, at least at the time performed).
Multiple results (a set). Which might be variously small or large (I'm thinking of some vaguely logrithmic scale for classifying this.)

I'd also like to see workflow included, some sense of a cataloguing workflow (desired, aquired, classified, converted (to some minimally-sufficient complexity best format, which is to say, LaTeX 😺 ) privacy scopes and controls, and relations between works (citations, references, translations, authors, concepts, projects, ...)

Mind, this is all but entirely vapourware.

@FiXato

#kfc #webfs #docfs

Last updated 4 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2081 followers · 14668 posts · Server toot.cat

@thornAvery My own approaches are:

Find LITERALLY ANY FORMAT OTHER THAN PDF. HTML, text, ePub, etc., if possible.
Try pdftotext, part of Poppler utils: https://poppler.freedesktop.org/ This is available for most Linux distros, MacOS under Homebrew, or check out via Git.

If I can get something vaguely reasonable, that's usually sufficient.

OCR is an option. I've never had good luck with that, and there's such a tremendous amount of tendous correcting that retyping is frequently preferable. That said, I operate at fairly low scale.
Retype by hand. Since I'm usually reading the work, this actually turns out to be a pretty good reading method for content-retention.

PDF itself is a container around a bunch of other formats. Asking how to convert a PDF is a bit like asking how to cook a bag full of groceries. It really depends on what's in it, and what you're hoping to get.

#PDF #PDFConversion #kfc #docfs #webfs

#pdf #pdfConversion #kfc #docfs #webfs

Last updated 4 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2081 followers · 14668 posts · Server toot.cat

@thornAvery I'm trying to find what I thought I remembered as an excellent HN comment discussing how to do this at scale.

It turns out to be really complicated.

That said, maybe tell us what it is you're trying to do, specifically:

How many documents.
How large.
What languages / charactersets.
What budget (if any).
What end-use.

#webfs #docfs #kfc #PDFConversion #pdf

#webfs #docfs #kfc #pdfConversion #pdf

Last updated 4 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2081 followers · 14668 posts · Server toot.cat

@thornAvery There's no such creature that will cover all cases. You may get lucky in many instances with easier options.

Your best bet is to find another form of the document that's closer to text. For many published documents there are good odds of this.

If the PDF is actually rendered from a text source, pdftotext is pretty good at extracting the actual text.

If it's not ... you're left with a much more challenging job. I find with rather startling frequency that simply re-typing the document from scratch is often the best option.

#pdf #PDFConversion #kfc #docfs #webfs

1/

#pdf #pdfConversion #kfc #docfs #webfs

Last updated 4 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2071 followers · 14632 posts · Server toot.cat

Adelaide Hasse - Wikipedia

The US Federal Government probably produces more documents than any other entity on Earth.

Adelaide Hasse (1868--1953) is the public-schooled, self-taught OG BAMF who created the indexing and classification system which still organises that to this day, the Superintendent of Documents Classification System (SuDoc).

https://en.wikipedia.org/wiki/Adelaide_Hasse

#AdelaideHasse #SuDoc #LibraryClassification #DocumentManagement #kfc #docfs #webfs #libraries

#AdelaideHasse #sudoc #LibraryClassification #DocumentManagement #kfc #docfs #webfs #libraries

Last updated 5 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2071 followers · 14632 posts · Server toot.cat

@mdhughes The metadata problem is one that I've been working at (very slowly) for some years now.

Go through my #KFC #docfs and #webfs tags for some context (it's all very loose). But a goal would be to extend filesystem metadata probably along the lines of Dublin Core Metadata, as well as Some Other Stuff.

It's ... complicated.

I don't see the application being one that's suited filesystem-wide (though I could be wrong on that). A documents-archive-specific filesystem (or extension / overlay) would be very much relevant.

@wftl

#kfc #docfs #webfs

Last updated 5 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2071 followers · 14632 posts · Server toot.cat

@brennen It sounds to me as if what you need is a state-driven browser environment, rather than a link-driven one.

That's ... kind of the central notion behind #webfs which started with the question "what if the Web were filesystem accessible?"

In your case, you'd want a sort-of-project environment, in which you could launch various web pages / apps specifically, but record which you had open, on what day, associated with what project. Also some notion of whether or not there was state-in-process (e.g., Gerrit / Phabricator, Chat), or just reference (manpages, Google docs, presuming read-only).

It is possible to launch Firefox/Chrome pages from the commandline. A bash/zsh project-aware context (and specific shell history, oh, bash now has timestamps) might buy you something.

Additional tabs you open from the browser ... might not be preserved.

Oh: Firefox also has specific profiles. So you could have a project-linked profile. (Though you'd need to, say, install and update extensions individually for each.)

Noodling.

Does any of this sound vaguely useful?

#webfs

Last updated 5 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2071 followers · 14632 posts · Server toot.cat

Lyle Lovett: Here I Am

@brennen I've been noodling at a concept under the rubric of #webfs / #docfs / #kfc for a few years, which is ultimately strongly influence by Paul Otlet, Ted Nelson, Vannevar Bush, etc. Which is to say, far smarter people than me have tried and failed. But where would you be if you didn't try, as Lyell Lovett said.
https://invidious.snopyta.org/watch?v=KvDPezXTzlI

I think any system is going to have to have a review-and-cull stage, and you'll have to schedule (time-budget) for those steps. This includes the calendrical bookmarks model.

The heart of "KFC" (Krell Functional Context) is the idea that "identity" for a document is a search function, and if you can define a search that returns a document, you've identified it, at least from a strictly pragmatic view.

Metadata --- title, author, date, subject, references, citations, tags, keyword search --- are all search indicia. And the key for your system is to provide a useful way of returning to some prior work.

Discovery / use context is also a search criterian. That's what tree-style tabs offers, though it's a fragile and brittle association that cannot be saved by any mainstream browser of which I'm aware.

2/

#webfs #docfs #kfc

Last updated 5 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2071 followers · 14632 posts · Server toot.cat

reddit - Pocket: It gets worse the more you use it

@brennen Short answer: not that I've found, yet, to my satisfaction.

Some things that don't work IME:

Google Chrome, especially on Android. Anything past ~5 tabs is utterly unmanageable. My "close tabs" post-it note is dated April. Of 2015.
Pocket. See https://old.reddit.com/r/dredmorbius/comments/5x2sfx/pocket_it_gets_worse_the_more_you_use_it/
Tree-Style Tabs. A good start, but ultimately not the solution.
Zotero. I just don't think like it does. Friction is too high.
Save-as-PDF. You now have an unorganised pile elsewhere.
Zettlekasten / index cards. Useful, but too high-friction for online stuff.

1/

#TabManagement #InformationManagement #DocumentManagement #kfc #docfs #webfs

#TabManagement #informationmanagement #DocumentManagement #kfc #docfs #webfs

Last updated 5 years ago

Original post

Doc Edward Morbius ⭕ · @dredmorbius

2071 followers · 14632 posts · Server toot.cat

#DearMastomind: What different types / uses of tables can you think of?

I'm looking with a mind to document / Web formatting and styles.

Of the top of my head:

Short lists (such as this one), which are effectively single-column tables usually with far too much text crammed into a single line (such as this one).
Data tables.
Textual tables --- generally with few or no quantitative cells.
Simple (or not-so-simple) spreadsheets, offering sort, totals, and/or other summary statistics, possibly subsetting or cross-tabulation capabilities, in an interactive or at least intelligently-computed sense.
Graph-adjacent tables. Data tables which are directly related, possibly interactively linked, to some data visualisation(s).
Tabular layout. Tables used principally to organise and arrange longer bits of textual content. Need not be a classic HTML table layout or grid, though approaches this.

Different uses might have different formatting, including borders, "greenbar" separators, interactive sort or filtering capabilities (typically created now with Javascript, though native browser support might be handy), etc.

If you can think of a good discussion or reference addressing this question that would also be helpful.

#layout #tables #html #css #latex #docfs #webfs #kfc #browsers

#dearMastomind #layout #tables #html #css #latex #docfs #webfs #kfc #browsers

Last updated 5 years ago

Original post