IBM Watson™ Discovery Service Ideas

We've moved...

You'll be redirected shortly, we've moved to our new idea portal: https://ibm-watson.ideas.aha.io

Add state to Discovery Service model training process after adding training samples

This was reported as an issue in the python sdk: https://github.com/watson-developer-cloud/python-sdk/issues/339

The issue has the following description :

I am adding training samples to Discovery Service from our training set in batches and then evaluating query performance after each batch is added to collect data on how relevancy performance improves/changes as samples from our training set are added to the collection.

After adding a batch of training samples the best way I could figure out to determine when the ranking model has updated involves using a method like this to poll the collection details api. This method relies on non-obvious logic and the fact that training_status.successfully_trained and training_status.data_updated return empty string when model has never been trained or training data has never been added (the method wrap_run_query is used to handle timeouts/connection errors and included below just for reference)

```python
class CustomDiscovery(DiscoveryV1):

def wrap_run_query(self, run_query, max_failures=10):
"""Wrap a query with error-handling/retry logic"""
def wrapped():
num_failures = 0
timeout = 1
while True:
try:
num_failures += 1
return run_query()
except (WatsonException, # pylint: disable=W0703
requests.Timeout,
requests.exceptions.ReadTimeout,
urllib3.exceptions.NewConnectionError,
urllib3.exceptions.ConnectionError,
urllib3.exceptions.ReadTimeoutError,
Exception
) as err:
if num_failures > max_failures:
print("Watson API failure too many times in a row. Quitting.")
raise err
error_message = str(err)
if "exceeded the rate limit" in error_message or "Query timed out" in error_message:
print("Exceeded rate limit")
elif "busy processing" in error_message:
print("Hit Update Service Limit")
elif 'Query failed' in error_message:
print("Hit Query Failed")
elif "ConnectTimeoutError" in error_message:
print("Hit Connect Timeout")
elif "Max retries exceeded" in error_message:
print("Failure when connecting")
else:
print("HIT UNKNOWN EXCEPTION: ", error_message)
if self.options.VERBOSE:
print(err)
print("Number of Failures: {0} Will retry in {1} seconds".format(num_failures, timeout))
time.sleep(timeout)
timeout *= 1.5
return wrapped

def poll_collection(self, environment_id: str, collection_id: str):
"""poll collection details until finished processing documents/training data"""
def run_query():
while True:
details = self.get_collection(environment_id=environment_id, collection_id=collection_id)
document_counts = details["document_counts"]
training = details["training_status"]

# returns empty string if never trained
current_model_date = training["successfully_trained"]
# returns empty string if training data never added
data_update_date = training["data_updated"]
if current_model_date:
current_model_date = aniso8601.parse_datetime(current_model_date)
if data_update_date:
data_update_date = aniso8601.parse_datetime(data_update_date)

if document_counts["processing"] > 0:
print("Document updates still processing. {0} documents in processing queue".format(document_counts["processing"]))
time.sleep(2)
elif training["processing"]:
print("Training updates still processing. Total number of Samples: {0}".format(training["total_examples"]))
if self.options.VERBOSE:
print("Collection Details: ", details)
time.sleep(4)
elif current_model_date and data_update_date and current_model_date < data_update_date or data_update_date and not current_model_date:
print(
"Training work is needed but training has not yet entered processing state. Total number of Samples: {0}".format(training["total_examples"]))
if self.options.VERBOSE:
print("Collection Details: ", details)
time.sleep(4)
else:
print("Number of documents available state after applying updates: {0}".format(document_counts["available"]))
print("Number of documents in processing state after applying updates: {0}".format(document_counts["processing"]))
print("Number of documents in failed state after applying updates: {0}".format(document_counts["failed"]))
if training:
print("Trained ranker available? {0}".format(training["available"]))
if training["available"]:
print("Model creation date: {0}".format(training["successfully_trained"]))

print("Number of training examples: {0}".format(training["total_examples"]))
print("Minimum Queries Added: {0}".format(training["minimum_queries_added"]))
print("Minimum Examples Added: {0}".format(training["minimum_examples_added"]))
print("Sufficient label diversity: {0}".format(training["sufficient_label_diversity"]))
print("Number of Notices: {0}".format(training["notices"]))
break
return details
return self.wrap_run_query(run_query)()
```

I'm observing that the training_status.processing field returned by the collection details api doesn't change state from value False to True until some _indeterminate_ amount of time after adding a sufficient set of training samples ... With this behavior -- the client side logic needed to evaluate the system processing state in a stateless manor is kind of ugly ... I think there should be a tri or quad-state training_status.processing_state value ["no_training_needed", "training_scheduled", "training_processing"] or maybe if an error case exists ["no_training_needed", "training_scheduled", "training_processing", "training_error"]. This would give clients a simpler way to determine when the training process has converged after training data has been added to the collection.

  • Guest
  • Jan 4 2018
  • Attach files
  • Nathaniel Cohen commented
    January 9, 2018 02:41

    Hi -- I created this issue.  Please feel free and contact me for any needed clarifications.