Skip to content

Support for Marked Content IDs (MCID) in content streams #184

@HeimMatthias

Description

@HeimMatthias

We would like to use mupdf.js to make pdf-files accessible. While mupdf.js allows us to create / update the pdf objects that provide the structure for screen readers, there is (afaik) currently no api to to access the MCID elements in the content stream.

In the content stream, marked content is wrapped with BDC / EMC commands and labeled with an id with an MCID command. e.g.

/H1 << /Lang (en-US) /MCID 100 >> BDC
(Heading Text) Tj
EMC

Ideally, mupdf would support basic tag-creation, so that the elements in an unmarked pdf can be accessed via MCID-numbers.
But, more importantly, mupdf.js should provide API access to the MCID-numbers via the StructuredTextWalker-API. Currently, there only seems to be very low-level API-Support in the PDFProcessor (https://mupdf.readthedocs.io/en/latest/reference/javascript/types/PDFProcessor.html#marked-content).

In the future, we would also welcome a high-level API to work with structured text, i.e. to create/modify the /StructTreeRoot, and its underlying elements /RoleMap, /ClassMap and /IDTree. Nevertheless, it is already possible to implement these functions with mupdf.js, whereas the MCID-tagging of the content streams remains inaccessible with the current APIs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions