ICYMI, I ran some experiments to see if #VeraPDF’s parse status can be used to predict #PDF rendering problems, using an existing dataset of synthetic PDFs as ground truth. I also looked at how this compares against the occurrence of #JHOVE validation errors.
Details in this blog post:
https://www.bitsgalore.org/2023/06/29/verapdf-parse-status-as-a-proxy-for-rendering
Out of curiosity I ran both #JHOVE and #VeraPDF on the "Synthetic #PDF Testset for File Format Validation" by @mickylindlar et al. (link: https://www.radar-service.eu/radar/en/dataset/JtlOdwQquZWDqQdq).
Then did a quick comparison between validation errors as reported by JHOVE, and parse errors and logged warnings by VeraPDF.
Main result so far is that majority of PDFs for which JHOVE reports validation errors, also result in either parser error or warning in VeraPDF. Sneak peek here:
jicymi #JHOVE 1.28 was fully released last week. #oag3
Download and info:
https://jhove.openpreservation.org/
Release notes:
https://github.com/openpreserve/jhove/blob/integration/RELEASENOTES.md
conformance levels of #JHOVE (and other validators):
(1) well-formed = meets the purely syntactic requirements (i.e., what's in the standard)
(2) vaild - well-formed and meets the higher-level semantic req (i.e., what's in the schema)
(3) consistent - valid and internally extracted info is consistent with externally supplied information (i.e., what's in your policy)
@mickylindlar i like this, and i definitely think there would be an appetite (#JHOVE users are hungry!) i fully agree that a group raising issues like this would be really productive, and like that it would be just for users. should i put some feelers out?
Hi @MediaArea ! I have two WAVE files (among 16 "regular" files) that #JHOVE identifies as PCMWAVEFORMAT and MediaInfo WAVE with a DTS encoding. These come from a transferred audio CD (http://ark.bnf.fr/ark:/12148/cb435422954). Have you any idea why? For further investigation, should we share the file? Thanks for your help!!
@marhop @bitsgalore @archivist_Liz
IFDs can be used to store a thumbnail or EXIF metadata, but unlike #JHOVE, #Exiftool seems to return information only for IFDs that contain images with significant content (though nothing prevents you from embedding a thumbnail that is just a small image with no relation with the main one!).
We use #JHOVE for such a task, we parse the XML output and count the IFDs of type "TIFF" whose "Newsubfiletype" = "0".
@marhop @bitsgalore @archivist_Liz
Yup, but in this case, #JHOVE will return "reduced-resolution image of another image in this file" in its "NewSubFileType" element. If it's an image with a different content, #HOVE should return "0".
#OPFOAG Thomas Ledoux advocates for a standard #schematron edition tool to enforce institutional policies on #JHOVE, #veraPDF & #jpylyzer outputs.
#jpylyzer #veraPDF #JHOVE #schematron #OPFOAG
@mickylindlar shows how #JHOVE output is mapped to #Rosetta properties, so that files can be queried by a specific error. Wow!
#OPFOAG
Eh, les personnes intéressées par la préservation numérique : #JHOVE existe lui aussi en version en ligne pour des analyses unitaires...
https://openpreservation.org/tools/jhove/jhove-web-demonstrator/
#DigiPres_FR #PINFormats
(Un peu honteux de le découvrir aujourd'hui mais bon.)
#PINFormats #DigiPres_FR #JHOVE #OPFOAG
@mickylindlar
Carl: "PDF is a huge tree of objects linked one to another." Which makes interpreting errors far from intuitive!
But #veraPDF, and soon #JHOVE, should be able to associate an error to the problematic zone in the PDF.
#JHOVE tutorial: Carl Wilson reminds that the software is extensible, it's pretty simple to plug in a module for a format that you would have developed by yourself.
Vous êtes intéressés par la numérisation et la préservation numérique ? Vous êtes à Paris le 7 décembre ? L'Open Preservation Foundation est à la BnF. Le matin, un atelier sur l'outil #JHOVE (https://openpreservation.org/events/jhove-training/?q=5560) est prévu et l'après-midi, l'organisation tient une assemblée.
L'inscription est payante pour les organisations non membres. Mais c'est une occasion à ne pas manquer pour rencontrer des membres éminents et très ouverts de la communauté !
#PINFormats #DigiPres_FR #JHOVE
Vous êtes intéressés par la numérisation et la préservation numérique ? Vous êtes à Paris le 7 décembre ? L'Open Preservation Foundation est à la BnF. Le matin, un atelier sur l'outil #JHOVE (https://openpreservation.org/events/jhove-training/?q=5560) est prévu et l'après-midi, l'organisation tient une assemblée.
L'inscription est payante pour les organisations non membres. Mais c'est une occasion à ne pas manquer pour rencontrer des membres éminents et très ouverts de la communauté !