IBM Watson™ Discovery Service Ideas

We've moved...

You'll be redirected shortly, we've moved to our new idea portal: https://ibm-watson.ideas.aha.io

Remove or Strip HTML tags during ingestion

Affects both ingestion and document conversion / segmentation.

 

Benefits usage of the ingested content making it consumable in a more basic format.   Original customer use case was to not only remove / strip HTML but also segment based on HTML header level.  So this content:

======

<h2>

My content for first document.

<p>

Is this and I really am not sure if or how to handle <b>stylistic markup</b>

</p>

</h2>

<h2>

And here is my second document.

</h2>

=======

Ingested content resulting in two JSON documents.   

 

Follow up investigation on this idea:

  • how to handle imbedded stylistic markup
  • if the documents are split, should there be some relationship kept between them and the source
  • Guest
  • Sep 5 2018
  • Already exists
Why is it useful?
Who would benefit from this IDEA? As a user of ingested content, I want to be able to use result text in text format without markup so that I can present it directly to the end user
How should it work?
Idea Priority
Priority Justification
Customer Name
Submitting Organization
Submitter Tags
  • Attach files
  • Guest commented
    September 05, 2018 14:51

    I did see this reference in the documentation: https://ibm.box.com/s/kb9r2u6x9qpttjrqqxu2j3wrhnd5y1uq which handles the portion of the idea re: splitting by HTML header level, however the removal of the HTML within the split document is separate.

  • Randy Haven commented
    September 05, 2018 15:05

    Adding another comment to register origin of this idea.

    haven@us.ibm.com (I may not have logged in properly when entering the original idea)

  • Randy Haven commented
    September 05, 2018 16:53

    Created cross reference at: https://ibm-watson.ideas.aha.io/ideas/WDS-I-201 with proper login.

  • Admin
    Phil Anderson commented
    31 Jan 16:48

    Hi Randy, doesn't this already exist?  We give options to remove specific tags and also create a pure text field.