IBM Watson™ Discovery Service Ideas

We've moved...

You'll be redirected shortly, we've moved to our new idea portal:

Ability to split paragraphs inside a document

Several documents rely on a structure that leverages a single section for a bunch of unrelated paragraphs, we'd like to do document splitting based on paragraph marks at ingestion time.


This could be accomplished by allowing other tags (not only H1, H2, Hx...) to split a document.

  • Renato dos Santos Leal
  • Apr 13 2018
  • Future Consideration
Why is it useful?
Who would benefit from this IDEA? As a customer I'd like to access specific paragraphs inside a section in order to retrieve a more specific answer.
How should it work?
Idea Priority
Priority Justification
Customer Name
Submitting Organization
Submitter Tags
  • Attach files
  • Percy Shi commented
    April 20, 2018 23:34

    Using predefined tags as an option to define the desired boundary of a paragraph(passage) will be very helpful to get a self-explainable answer from WDS.


    The current passage level function seems much to be based on standard html tag(<p>), and guessing the paragraph/passage boundary by the trailing \r\n and the leading space of the following text line. This approach is not able to reserve the context from the "malformatted" documents(most technology manuals, troubleshooting guidelines, administration guidelines etc.) where natural language and computer language are intermingled, hence the common format of boundary of a paragraph is not achievable.

  • Admin
    Phil Anderson commented
    May 09, 2018 05:20

    This feature already exists:

  • Renato dos Santos Leal commented
    May 09, 2018 14:13

    Hi Phil, it does exists but only for H1 to H6, it would be helpful to do it for some other HTML tags.

  • Deepak Sekar commented
    May 15, 2018 06:14

    We need paragraph splitting based on certain break rules, the kind of segmentation available in Watson Content Analytics Custom Annotation pipeline / pdfs to paragraphs converter in WEX. The limit being 250 is also a problem for clients having huge documents. In our usecase we have the largest document with 15,000 paragraphs.