IBM Watson™ Discovery Service Ideas

We've moved...

You'll be redirected shortly, we've moved to our new idea portal:

Remove or Strip HTML tags during ingestion

Affects both ingestion and document conversion / segmentation.


Benefits usage of the ingested content making it consumable in a more basic format.   Original customer use case was to not only remove / strip HTML but also segment based on HTML header level.  So this content:



My content for first document.


Is this and I really am not sure if or how to handle <b>stylistic markup</b>




And here is my second document.



Ingested content resulting in two JSON documents.   


Follow up investigation on this idea:

  • how to handle imbedded stylistic markup
  • if the documents are split, should there be some relationship kept between them and the source
  • Guest
  • Sep 5 2018
  • Already exists
Why is it useful?
Who would benefit from this IDEA? As a user of ingested content, I want to be able to use result text in text format without markup so that I can present it directly to the end user
How should it work?
Idea Priority
Priority Justification
Customer Name
Submitting Organization
Submitter Tags
  • Attach files
  • Guest commented
    September 05, 2018 14:51

    I did see this reference in the documentation: which handles the portion of the idea re: splitting by HTML header level, however the removal of the HTML within the split document is separate.

  • Randy Haven commented
    September 05, 2018 15:05

    Adding another comment to register origin of this idea. (I may not have logged in properly when entering the original idea)

  • Randy Haven commented
    September 05, 2018 16:53

    Created cross reference at: with proper login.

  • Admin
    Phil Anderson commented
    31 Jan 16:48

    Hi Randy, doesn't this already exist?  We give options to remove specific tags and also create a pure text field.