-
|
Hi, I just discovered this tool. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
I've just tested, the doc format is not supported. I think it's really important to have this support, because even if the format is no longer used, the number of documents existing with this format is very large, if not the largest. Having explored everything, here are some things that may help: Using openoffice or unoconvhttps://github.com/unoconv/unoconv Openoffice doesn't handle parallel execution very well. You can't run the conversion of millions of documents in parallel. Even with unoconv, which tries to remedy this. antiwordhttps://github.com/grobian/antiword Apache TikaClearly the best tool and the one I use today to extract text from doc format. It handles this very well, work in parallel and it is written in Java. It is a simple command line to run. Creating a markitdown pluginHowever, these tools lack functionality, such as the ability to retrieve formatting. So it would be really nice if a microsoft tool could support the historical microsoft format. Microsoft has published the format specifications here . So there is no need for retro engineering. I don't think it would be that difficult to make a minimalist parser in python based on the apache Tika code: |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I finally succeed to create a greate reader for the old legacy word doc format. It is written in rust I m going to create a PR for markitdown. |
Beta Was this translation helpful? Give feedback.
Hi,
I finally succeed to create a greate reader for the old legacy word doc format.
It works like a charm ( thanks Claude and Microsoft to open the standard specification )
It is written in rust
https://github.com/dridk/unword
I m going to create a PR for markitdown.