Jump to content

Welcome! You're currently a Guest.

If you'd like to join in the Discussion, or access additional features in our forums, please sign in with your Evernote Account here. Have an Evernote Account but forgot your password? Reset it! Don't have an account yet? Create One! You'll need to set your Display Name before your first post.

Photo

Python API and Unicode


  • Please log in to reply
3 replies to this topic

#1 handsomeransoms

handsomeransoms

  • Pip
  • Title: Member
  • Group: Members
  • 2 posts

Posted 17 June 2012 - 09:54 PM

I'm using the most recent release (1.21) of the Evernote Python API. The strings returned by the API's noteStore methods are ASCII-encoded, not Unicode/UTF-8. For example:

note = noteStore.getNote(DEVELOPER_TOKEN, n_guid,
    withContent=True,
    withResourcesData=False,
    withResourcesRecognition=False,
    withResourcesAlternateData=False
)

print type(note.content) # => str
unicode( note.content, "ascii" ) # => UnicodeDecodeError
unicode( note.content, "utf-8" ) # works

This is true of other fields, such as the *****le field, on Notes. According to the source, these fields are all of type thrift.Thrift.TType.STRING. Looking at the Thrift source, it appears that this is meant to represent ASCII-encoded strings. There are separate types for Unicode strings (UTF7, UTF8, UTF16).

This is bad for a few reasons. First of all, some of this data returned by the API is inherently Unicode-encoded. For example, note content is written in ENML which is explicitly encoded as Unicode UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
. Returning everything as ASCII leads to numerous bugs when using most libraries (for example, Jinja2). You get errors like these:

In [45]: unicode( note.content, "ascii" )
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
/Users/grobinso/Dropbox/entervals/<ipython-input-45-fd7ebf585aec> in <module>()
----> 1 unicode( note.content, "ascii" )

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1589: ordinal not in range(128)

Most annoying is the solution - I have to explicitly convert all of the string fields returned from Evernote API calls to Unicode in order to work with them. Imagine this in every piece of code that makes an API call:


# debug unicode
try:
    unicode( note.content, "ascii" )
except UnicodeError:
    note.content = unicode( note.content, "utf-8" )
else:
    # value was valid ASCII data
    pass

# or, just blanket convert everything to UTF-8
note.content = unicode( note.content, "utf-8" )
# ... likewise, for every field I work with, depending on what I'm going to do with it

Is there a reason why the API does not use Unicode strings? Would changing this be as easy as subs*****uting thrift.Thrift.TTypes.UTF8 for thrift.Thrift.TTypes.STRING?

#2 handsomeransoms

handsomeransoms

  • Pip
  • Title: Member
  • Group: Members
  • 2 posts

Posted 17 June 2012 - 09:58 PM

I am not sure why the editor is subs*****uting ***** (t i t) with asterisks. Is this part of an attempt at a profanity filter?

Edit: apparently it only subs*****ues it inside of words, but not on its own. Nice.

Edited by handsomeransoms, 17 June 2012 - 09:59 PM.


#3 jefito

jefito

  • Title: Evernote Evangelist
  • Group: Evernote Evangelist
  • 8,171 posts

Posted 18 June 2012 - 01:15 AM

The bad-word filter is currenty broken (set to over-aggressive); gbarry is aware and working to get it fixed.
~Jeff
EVERNOTE: Getting Started | Support Page | Knowledge Base | Support Requests
If someone helped you, or you like or agree with someone's post, let them (and us) know by clicking their post's "Like" button.

#4 berryboy

berryboy

  • Pip
  • Title: Member
  • Group: Members
  • 42 posts

Posted 06 September 2012 - 02:51 PM

I'm a python noob. And, this post here help me solve the problem, exactly.

Thank you.




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users

Clip to Evernote