Doc support? #335

dridk · 2025-02-15T13:04:10Z

dridk
Feb 15, 2025

Hi,

I just discovered this tool.
Please, tell me that you support the old doc ( not docx ) format.
Actually, there is only antiword and Apache tikka that work with it.

Answered by dridk

Apr 1, 2026

Hi,

I finally succeed to create a greate reader for the old legacy word doc format.
It works like a charm ( thanks Claude and Microsoft to open the standard specification )

It is written in rust
https://github.com/dridk/unword

I m going to create a PR for markitdown.

View full answer

dridk · 2025-02-15T14:40:29Z

dridk
Feb 15, 2025
Author

I've just tested, the doc format is not supported.

I think it's really important to have this support, because even if the format is no longer used, the number of documents existing with this format is very large, if not the largest.
For example, in hospital, there are a lot of trouble retrieving the content of these old documents, impacting research and care.

Having explored everything, here are some things that may help:

Using openoffice or unoconv

https://github.com/unoconv/unoconv
https://www.openoffice.org/fr/

Openoffice doesn't handle parallel execution very well. You can't run the conversion of millions of documents in parallel. Even with unoconv, which tries to remedy this.

antiword

https://github.com/grobian/antiword
This tool does not work on all doc format. The code no longer seems to be maintained.

Apache Tika

https://tika.apache.org/

Clearly the best tool and the one I use today to extract text from doc format. It handles this very well, work in parallel and it is written in Java. It is a simple command line to run.

Creating a markitdown plugin

However, these tools lack functionality, such as the ability to retrieve formatting. So it would be really nice if a microsoft tool could support the historical microsoft format.

Microsoft has published the format specifications here . So there is no need for retro engineering.
https://learn.microsoft.com/en-us/openspecs/office_file_formats/ms-doc/ccd7b486-7881-484c-a137-51170af7cc22#published-version

I don't think it would be that difficult to make a minimalist parser in python based on the apache Tika code:
https://github.com/apache/tika/blob/2c9b6ab48a169f850bdfe2f4676afa6af2c1c540/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java

0 replies

dridk · 2026-04-01T20:37:12Z

dridk
Apr 1, 2026
Author

Hi,

I finally succeed to create a greate reader for the old legacy word doc format.
It works like a charm ( thanks Claude and Microsoft to open the standard specification )

It is written in rust
https://github.com/dridk/unword

I m going to create a PR for markitdown.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc support? #335

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Doc support? #335

Uh oh!

dridk Feb 15, 2025

Replies: 2 comments

Uh oh!

Uh oh!

dridk Feb 15, 2025 Author

Using openoffice or unoconv

antiword

Apache Tika

Creating a markitdown plugin

Uh oh!

dridk Apr 1, 2026 Author

dridk
Feb 15, 2025

dridk
Feb 15, 2025
Author

dridk
Apr 1, 2026
Author