@alcinnz So, effectively a filetype:application association manager. file(1) and magic(5) on steroids.
I am thinking of managing metadata associated with documents, works (multiple forms / manifestations of a single document), projects and workflows (involving various records, etc), and the overall document lifecycle: creation, acquisition, cataloguing, use, adaptation, distribution, destruction.
That's what I've lumped under my #webfs and #docfs concepts, along with #kfc (Krell Functional/Fucking Context).
@billjanssen Thanks again. Some of that looks ... closer. Cone Tree and Perspective Wall most so, though still not quite there.
Are you associated with this research/develpment, or just an interested party?
One thing I've thought about considerably as I'm increasingly using e-book readers and being frustrated by their own document management / organisational limitations, is how physical library space maps, with multiple dimensional convulutions, to stored data:
There's a mix of physical and logical organisations:
character -> word -> line -> page - > signature -> book
character -> word -> sentence -> paragraph -> chapter -> book
Shelf -> bookcase -> aisle -> floor -> building
A book (nominally: 250 pages) is about 125k words.
About 32 books fit to a shelf, 8 shelves to a bookcase, say, 16 bookcases to an aisle, 16 aisles to a floor. (I'm biasing to powers-of-two numbers here)
That's 256 books per case, 4,096 per aisle, 65,536 per floor.
(A fairly large community library is on the order of 300k books, or about 4 floors as I've defined them. A large university library, 122 such floors. Based on my experience, I may be underspecifying density, and would be interested in actual data.)
And so on.
The point I'm trying to make though isn't about density but of navigation of that space. The reader/researcher can go to a specific book, or to a shelf (closely related works), an aisle, a floor, etc. There's a different level of aggregation at each point in the scale, and for topically-organised (e.g., Library of Congress classification or Dewey Decimal), a specific region corresponds largely with a specific subject grouping.
On my e-book reader, I'm effectively limited to only one level of aggregation: a sequential shelf scan of books. With storage exceeding several TB, and an average book size of ~1--5 MB, that's effectively a fairly large community library worth of potential documents which can be carried in one's hand or satchel, but for which the organisational capabilities are ... exceedingly limited.
This remains a major frustration of mine.
#kfc #docfs #webfs #libraries #DocumentManagement
@Researchbuzz The proximity element is limited as I am, of course, on Altair IV, some 20 of your light years away.
That said, one of my obsessions (though not necessarily a major element of my Mastodon tooting) is information, knowledge, and document management.
The tags #kfc, #webfs, and #docfs will lead to a few of my information-management / search toots / threads.
And if you've got opinions, feelings, and/or deep intel on #PaulOtlet and his #Mundaneum I'm all ears.
#kfc #webfs #docfs #PaulOtlet #mundaneum
I see your open-plan office AND RAISE YOU!!!
The Central Social Institution of Prague. It’s apparently still in operation.
https://www.vintag.es/2020/01/central-social-institution-prague.html
#CentralSocialInstitution #Prague #Czechia #DataStorage #InformationManagement #KFC #DocFS #WebFS
#CentralSocialInstitution #prague #czechia #datastorage #informationmanagement #kfc #docfs #webfs
@natecull There's a strong element of the ideas that I'm playing with in #kfc which is oriented around this.
The principle interfaces would be the filesystem and shell. (The filesystem would be strongly document-oriented: #docfs & #webfs.)
And utilities would act on those documents through workflows, projects, groups, tasks, etc.
Whether or not that fits what anyone actually wants to do is ... another matter.
@jonny My principles here are:
Estimates I'm aware of are that there are on the order of 100--200m books ever published, growing at ~1m year, and a generally comparable set of scientific articles. News organisations such as Reuters, AP, and AFP produce about 1k--5k items daily, and I suspect many of those are photos or videos. Major newspapers tend to produce about 100--500 stories daily (weekday vs. weekend). You can work out ballpark maths from that.
For correspondence, the originator and recipient ("From:" and "To:" are both significant. Those might be referenced. Publishing, to a general audience, is in a sence correspondence where "From:" == Author and "To:" == World.
The filename need not be precise, exact, or an accurate presentation of conents, but USEFUL. That is, within a corpus, can I find a specific work or works of interest. In this sense, the titling scheme is an example of the principle I've developed that search is identity, in the sense that a search might produce 0, 1, or n>1 results. 0 is null, 1 is identity, and > 1 is a result set.
There are other naming and cataloguing schemes. A complete system would have correspondences between these and the conventional / human-readable titles, e.g., ISBN, LOCCS, OCLC, DOI, etc.
And yes there are other cataloguing systems such as SuDoc (used by the US government) which are useful in their own contexts.
Author, date, content, audience, and publisher are generally useful search-space reducing concepts of fairly generally applicable context. E.g., if I were including, say, store receipts or purchase orders, the vendor, customer, date, location, and a summary of contents (say, largest item) a description. Computer logs tend to be time and process/service oriented, perhaps also mentioning user or network address, etc.
Related hashtags and discussion:
#docfs #webfs #kfc #PaulOtlet #maundenaum
@Valenoern This is the essential idea behind "docfs", which would be a document-oriented filesystem. Its networked sibling being "webfs".
"Document" here is in the sense of #PaulOtlet, of any durable record. That might be a text, image, sound, video, multimedia content, data, software, or an amalgamation or melange.
One of my key ideas is that the metadata for these documents would be part of the filesystem, extending the notion of what constitutes file-centric data. I'd like to see some form of bibliographic data presented, where available for public and published media (book, articles, audio recordings, films).
Search is another element, and one idea for the filesystem would be as a virtual filesystem in which attributes could be supplied until a single item matching those criteria was found. "Identity is search".
For projects, some concept of structured workflows, with groups, tasks, milestones, and contributing data. For a sufficiently structured organisation, security and access controls.
I'd like the whole concept to be as commercialisation-hostile as possible, with both copyrights and payments entirely out of scope.
#docfs #webfs #kfc #maundenaum #DublinCore #metadata #bibliography #Plan9OS #Schopenhauer
#PaulOtlet #docfs #webfs #kfc #maundenaum #dublincore #metadata #bibliography #plan9os #schopenhauer
@CyberpunkLibrarian I'd very much like that.
I've been half-assedly kicking around an idea to build such a thing, generally referred to as #KFC (Krell Functional Context / Krell Fucking Context, variously). See also #WebFS and #DocFS which relate: accessing the Web as a filesystem (see Plan9OS) and a documents-oriented filesystem in which "paths" are actually "search queries" through various spaces (author, title, pubdates, subjects / keywords, publishers, identifiers ISBN/OCLC/LOCCN/DOI, etc).
The results of any path specification are strictly one of:
I'd also like to see workflow included, some sense of a cataloguing workflow (desired, aquired, classified, converted (to some minimally-sufficient complexity best format, which is to say, LaTeX 😺 ) privacy scopes and controls, and relations between works (citations, references, translations, authors, concepts, projects, ...)
Mind, this is all but entirely vapourware.
@thornAvery My own approaches are:
Find LITERALLY ANY FORMAT OTHER THAN PDF. HTML, text, ePub, etc., if possible.
Try pdftotext
, part of Poppler utils: https://poppler.freedesktop.org/ This is available for most Linux distros, MacOS under Homebrew, or check out via Git.
If I can get something vaguely reasonable, that's usually sufficient.
OCR is an option. I've never had good luck with that, and there's such a tremendous amount of tendous correcting that retyping is frequently preferable. That said, I operate at fairly low scale.
Retype by hand. Since I'm usually reading the work, this actually turns out to be a pretty good reading method for content-retention.
PDF itself is a container around a bunch of other formats. Asking how to convert a PDF is a bit like asking how to cook a bag full of groceries. It really depends on what's in it, and what you're hoping to get.
#pdf #pdfConversion #kfc #docfs #webfs
@thornAvery I'm trying to find what I thought I remembered as an excellent HN comment discussing how to do this at scale.
It turns out to be really complicated.
That said, maybe tell us what it is you're trying to do, specifically:
#webfs #docfs #kfc #pdfConversion #pdf
@thornAvery There's no such creature that will cover all cases. You may get lucky in many instances with easier options.
Your best bet is to find another form of the document that's closer to text. For many published documents there are good odds of this.
If the PDF is actually rendered from a text source, pdftotext
is pretty good at extracting the actual text.
If it's not ... you're left with a much more challenging job. I find with rather startling frequency that simply re-typing the document from scratch is often the best option.
#pdf #PDFConversion #kfc #docfs #webfs
1/
#pdf #pdfConversion #kfc #docfs #webfs
The US Federal Government probably produces more documents than any other entity on Earth.
Adelaide Hasse (1868--1953) is the public-schooled, self-taught OG BAMF who created the indexing and classification system which still organises that to this day, the Superintendent of Documents Classification System (SuDoc).
https://en.wikipedia.org/wiki/Adelaide_Hasse
#AdelaideHasse #SuDoc #LibraryClassification #DocumentManagement #kfc #docfs #webfs #libraries
#AdelaideHasse #sudoc #LibraryClassification #DocumentManagement #kfc #docfs #webfs #libraries
@mdhughes The metadata problem is one that I've been working at (very slowly) for some years now.
Go through my #KFC #docfs and #webfs tags for some context (it's all very loose). But a goal would be to extend filesystem metadata probably along the lines of Dublin Core Metadata, as well as Some Other Stuff.
It's ... complicated.
I don't see the application being one that's suited filesystem-wide (though I could be wrong on that). A documents-archive-specific filesystem (or extension / overlay) would be very much relevant.
@brennen I've been noodling at a concept under the rubric of #webfs / #docfs / #kfc for a few years, which is ultimately strongly influence by Paul Otlet, Ted Nelson, Vannevar Bush, etc. Which is to say, far smarter people than me have tried and failed. But where would you be if you didn't try, as Lyell Lovett said.
https://invidious.snopyta.org/watch?v=KvDPezXTzlI
I think any system is going to have to have a review-and-cull stage, and you'll have to schedule (time-budget) for those steps. This includes the calendrical bookmarks model.
The heart of "KFC" (Krell Functional Context) is the idea that "identity" for a document is a search function, and if you can define a search that returns a document, you've identified it, at least from a strictly pragmatic view.
Metadata --- title, author, date, subject, references, citations, tags, keyword search --- are all search indicia. And the key for your system is to provide a useful way of returning to some prior work.
Discovery / use context is also a search criterian. That's what tree-style tabs offers, though it's a fragile and brittle association that cannot be saved by any mainstream browser of which I'm aware.
2/
@brennen Short answer: not that I've found, yet, to my satisfaction.
Some things that don't work IME:
Google Chrome, especially on Android. Anything past ~5 tabs is utterly unmanageable. My "close tabs" post-it note is dated April. Of 2015.
Pocket. See https://old.reddit.com/r/dredmorbius/comments/5x2sfx/pocket_it_gets_worse_the_more_you_use_it/
Tree-Style Tabs. A good start, but ultimately not the solution.
Zotero. I just don't think like it does. Friction is too high.
Save-as-PDF. You now have an unorganised pile elsewhere.
Zettlekasten / index cards. Useful, but too high-friction for online stuff.
1/
#TabManagement #InformationManagement #DocumentManagement #kfc #docfs #webfs
#TabManagement #informationmanagement #DocumentManagement #kfc #docfs #webfs
tired: Our customer's paperwork is profit. Our own paperwork is loss.[1]
wired: Your proprietay data format is loss. Our proprietary data format is profit.
I'd remembered the first aphorism from a long-ago collection of Murphy's Laws.
Thinking through my struggles at organising online and digital media, references, etc., I realised that a huge problem is that these formats don't serve my goals. They're designed far more around their authors' goals, or even more often, the publishers' goals, largely around advertising, marketing, tracking, building lock-in, creating and defending monopolies, and the like.
Digital formats that are in the end-user's interest and specification serve the user. Those that are in the publisher's specification serve the publisher.
A related thought is that a key affordance of printed periodicals (newspapers, magazines, journals) is that of garbage collection, to put a contemporary spin on it.
When you're done reading a newspaper or magazine, you pick up the whole lot and throw it out. There's an intermediate level of organisation other than "the article" and "the whole collection" (that is, everything published in your office or home), "the issue". (Or perhaps a box or shelf of archived media.) That is, _there are multiple naturally-occurring levels of aggregation.)
When you're trying to sort through a set of browser tabs, you generally have only two levels of aggregation: the individual tab, or the entire session. There are typically no intermediate levels, and sorting through what you want to keep (or re-read, or work with) means you've got to go through the set one at a time and resolve disposition. The data format serves the browser vendor, but not the user.
Tools such as Tree-Style Tabs, an absolutely essential Firefox extension, give a higher level of natural organisation, the tab tree. Here, a structure emerges, without user effort, of related content. At the top of the tree is whatever page began an exploration, and as you descend it, you go further down into the search. When cleaning up, it's possible to pick any given tab, branch, or whole tree, and close it out in one fell swoop. Garbage collection costs are reduced.
(Three guesses as to what I've been attempting to do, and the first two don't count.)
#media #paperwork #DigitalMedia #DigitalFormats #FileFormats #DataFormats #kfc #docfs #UserCentricDesign #TreeStyleTabs
#media #paperwork #digitalmedia #DigitalFormats #FileFormats #dataformats #kfc #docfs #UserCentricDesign #treestyletabs
#DearMastomind: What different types / uses of tables can you think of?
I'm looking with a mind to document / Web formatting and styles.
Of the top of my head:
Different uses might have different formatting, including borders, "greenbar" separators, interactive sort or filtering capabilities (typically created now with Javascript, though native browser support might be handy), etc.
If you can think of a good discussion or reference addressing this question that would also be helpful.
#layout #tables #html #css #latex #docfs #webfs #kfc #browsers
#dearMastomind #layout #tables #html #css #latex #docfs #webfs #kfc #browsers