Thursday, February 26, 2015

"Free" can help a book do its job



(Note: I wrote this article for NZCommons, based on my presentation at the 2015 PSP Annual Conference in February.)

Every book has a job to do. For many books, that job is to make money for its creators. But a lot of books have other jobs to do. Sometimes the fact that people pay for books helps that job, but other times the book would be able to do its job better if it was free for everyone.

That's why Creative Commons licensing is so important. But while CC addresses the licensing problem nicely, free ebooks face many challenges that make it difficult for them to do their jobs.

Let's look at some examples.

When Oral Literature in Africa was first published in 1970, its number one job was to earn tenure for the author, a rising academic. It succeeded, and then some. The book became a classic, elevating an obscure topic and creating an entire field of scholarly inquiry in cultural anthropology. But in 2012, it was failing to do any job at all. The book was out of print and unavailable to young scholars on the very continent whose culture it documented. Ruth Finnegan, the author, considered it her life's work and hoped it would continue to stimulate original research and new insights. To accomplish that, the book needed to be free. It needed to be translatable, it needed to be extendable.


Nga Reanga Youth Development: Maori Styles, an Open Access book by Josie Keelan, is another example of an academic book with important jobs to do. While its primary job is a local one, the advancement of understanding and practice in Maori youth development, it has another job, a global one. Being free helps it speak to scholars and researchers around the world.

Leanne Brown's Good and Cheap is a very different book. It's a cookbook. But the job she wanted it to do made it more than your usual cookbook. She wanted to improve the lives of people who receive "nutrition assistance"- food stamps, by providing recipes for nutritious and healthy meals that can be made without spending much money. By being free, Good and Cheap helps more people in need eat well.

My last example is Casey Fiesler's Barbie™ I Can Be A Computer Engineer The Remix! Now With Less Sexism! The job of this book is to poke fun at the original Barbie™ I Can Be A Computer Engineer, in which Barbie needs boys to do the actual computer coding. But because Fiesler uses material from the original under "fair use", anything other than free, non-commercial distribution isn't legal. Barbie, remixed can ONLY be a free ebook.

But there's a problem with free ebooks. The book industry runs on a highly evolved and optimized cradle-to-grave supply chain, comprising publishers, printers, production houses, distributors, wholesalers, retailers, aggregators, libraries, publicists, developers, cataloguers, database suppliers, reviewers, used-book dealers, even pulpers. And each entity in this supply chain takes its percentage. The entire chain stops functioning when an ebook is free. Even libraries (most of them) lack the processes that would enable them to include free ebooks in their collections.

At Unglue.it, we ran smack into this problem when we set out to bring books into the creative commons. We helped Open Book Publishers crowd fund a new ebook edition of Oral Literature in Africa. The ebook was then freely available, but it wasn't easy to make it free on Amazon, which dominates the ebook market. We couldn't get the big ebook aggregators that serve libraries to add it to their platforms. We realized that someone had to do the work that the supply chain didn't want to do.

Over the past year, we've worked to turn Unglue.it into a "bookstore for free books". The transformation isn't done yet, but we've built a database of over 1200 downloadable ebooks, licensed under Creative Commons or other free licenses. We have a long way to go, but we're distributing over 10,000 ebooks per month. We're providing syndication feeds, developing relationships with distributors, improving metadata, and promoting wonderful books that happen to be free.

The creators of these books still need to find support. To help them, we've developed three revenue programs. For books that already have free licenses, we help the creators ask for financial support in the one place where readers are most appreciative of their work- inside the books themselves. We call this "thanks for ungluing".

For books that exist as ebooks but need to recoup production costs, we offer "buy-to-unglue". We'll sell these books until they reach a revenue target, after which they'll become open access. For books that exist in print but need funding for conversion to open access ebook, we offer "pledge-to-unglue", which is a way of crowd-funding the conversion.

After a book has finished its job, it can look forward to a lengthy retirement. There's no need for books to die anymore, but we can help them enjoy retirement, and maybe even enjoy a second life. Project Gutenberg has over 50,000 books that have "retired" into the public domain. We're starting to think about the care these books need. Formats change along with the people that use them, and the book industry's supply chain does its best to turn them back into money-earners to pay for that care.

Recently we received a grant from the Knight Foundation to work on ways to provide the long-term care that these books need to be productive AND free in their retirements. GITenberg, a collaboration between the folks at Unglue.it and ebook technologist Seth Woodward is exploring the use of Github for free ebook maintenance. Github is a website that supports collaborative software development with source control and workflow tools. Our hope is that the ingredients that have made Github wildly successful in the open source software world will will prove to by similarly effective in supporting ebooks.

It wasn't so long ago that printing costs made free ebooks impossible. So it's no wonder that free ebooks haven't realized their full potential. But with cooperation and collaboration, we can really make wonderful things happen.

Monday, February 9, 2015

"Passwords are stored in plain text."

Many states have "open records" laws which mandate public disclosure of business proposals submitted to state agencies. When a state library or university requests proposals for library systems or databases, the vender responses can be obtained and reviewed. When I was in the library software business, it was routine to use these laws to do "competitor intelligence". These disclosures can often reveal the inner workings of proprietary vendor software which implicate information privacy and security.

Consider for example, this request for "eResources for Minitex". Minitex is a "publicly supported network of academic, public, state government, and special libraries working cooperatively to improve library service for their users in Minnesota, North Dakota and South Dakota" and it negotiates licenses databases for libraries throughout the three states.

Question number 172 in this Request for Proposals (RFP) was: "Password storage. Indicate how passwords are stored (e.g., plain text, hash, salted hash, etc.)."

To provide context for this question, you need to know just a little bit of security and cryptography.

I'll admit to having written code 15 years ago that saved passwords as plain text. This is a dangerous thing to do, because if someone were to get unauthorized access to the computer where the passwords were stored, they would have a big list of passwords. Since people tend to use the same password on multiple systems, the breached password list could be used, not only to gain access to the service that leaked the password file, but also to other services, which might include banks, stores and other sites of potential interest to thieves.

As a result, web developers are now strongly admonished never to save the passwords as plain text. Doing so in a new system should be considered negligent, and could easily result in liability for the developer if the system security is breached. Unfortunately many businesses would rather risk paying paying lawyers a lot of money to defend themselves should something go wrong than bite the bullet and pay some engineers a little money now to patch up the older systems.

To prevent the disclosure of passwords, the current standard practice is to "salt and hash" them.

A cryptographic hash function mixes up a password so that the password cannot be reconstructed. so for example, the hash of 'my_password' is 'a865a7e0ddbf35fa6f6a232e0893bea4'. When a user enters their password, the hash of the password is recalculated and compared to the saved hash to determine whether the password is correct.

As a result of this strategy, the password can't be recovered. But it can be reset, and the fact that no one can recover the password eliminates a whole bunch of "social engineering" attacks on the security of the service.

Given a LOT of computer power, there are brute force attacks on the hash, but the easiest attack is to compute the hashes for the most common passwords. In a large file of passwords, you should be able to find some accounts that are breachable, even with the hashing. And so a "salt" is added to the password before the hash is applied. In the example above, a hash would be computed for 'SOME_CLEVER_SALTmy_password'. Which, of course, is '52b71cb6d37342afa3dd5b4cc9ab4846'.

To attack the salted password file, you'd need to know that salt. And since every application uses a different salt, each file of salted passwords is completely different. A successful attack on one hashed password file won't compromise any of the others.

Another standard practice for user-facing password management is to never send passwords unencrypted. The best way to do this is to use HTTPS, since web browser software alerts the user that their information is secure. Otherwise, any server between the user and the destination server (there might be 20-40 of these for  typical web traffic) could read and store the user's password.

The Minitex RFP covers reference databases. For this reason, only a small subset of services offered to libraries are covered here. The authentication for these sorts of systems typically don't depend on the user creating a password; user accounts are used to save the results of a search, or to provide customization features. A Minitex patron can use many of the offered databases without providing any sort of password.

So here are the verbatim responses received for the Minitex RFP:

LearningExpress, LLC
Response: "All passwords are stored using a salted hash. The salt is randomly generated and unique for each user."
My comment: This is a correct answer. However, the LearningExpress login sends passwords in the clear over HTTP.

OCLC
Response: "Passwords are md5 hashed."
My comment: MD5 is the hash algorithm I used in my examples above. It's not considered very secure (see comments). OCLC Firstsearch does not force HTTPS and can send login passwords in the clear.

Credo
Response: "N/A"
My comment: This just means that no passwords are used in the service.

Infogroup Library Division
Response: "Passwords are currently stored as plain text. This may change once we develop the customization for users within ReferenceUSA. Currently the only passwords we use are for libraries to access usage stats."
My comment: The user customization now available for ReferenceUSA appears at first glance to be done correctly.

EBSCO Information Services
Response: "EBSCOhost passwords in EBSCOadmin are stored in plain text."
My comment: Should note that EBSCOadmin is not a end-user facing system. So if the EBSCO systems were compromised only library administrator credentials would be exposed. 

Encyclopaedia Britannica, Inc.
Response: "Passwords are stored as plain text."
My comment: I wonder if EB has an article on network security?

ProQuest
Response: "We store all passwords as plain text."
My comment: The ProQuest service available through my library creates passwords over HTTP but uses some client-side encryption. I have not evaluated the security of this encryption.

Scholastic Library Publishing, Inc.
Response: "Passwords are not stored. FreedomFlix offers a digital locker feature and is the only digital product that requires a login and password. The user creates the login and password. Scholastic Library Publishing, Inc does not have access to this information.”
My comment: The "FreedomFlix" service not only sends user passwords unencrypted over HTTP, it sends them in a GET query string. This means that not only can anyone see the user passwords in transit, but log files will capture and save them for long-term perusal. Third-party sites will be sent the password in referrer headers. When used on a shared computer, subsequent users will easily see the passwords. "Scholastic Library Publishing" may not have access to user passwords, but everyone else will have them.

Cengage Learning
Response: "Passwords are stored in plain text."
My comment: Like FreedomFlix, the Gale Infotrac service from Cengage sends user passwords in the clear in a GET query string. But it asks the user to enter their library barcode in the password field, so users probably wouldn't be exposing their personal passwords.

So, to sum up, adoption of up-to-date security practices is far from complete in the world of library databases. I hope that the laggards have improved since the submission date of this RFP (roughly a year ago) or at least have plans in place to get with the program. I would welcome comments to this post that provide updates. Libraries themselves deserve a lot of the blame, because for the most part the vendors that serve them respond to their requirements and priorities.

I think libraries issuing RFPs for new systems and databases should include specific questions about security and privacy practices, and make sure that contracts properly assign liability for data breaches with the answers to these questions in mind.

Note: This post is based on information shared by concerned librarians on the LITA Patron Privacy Technologies Interest Group list. Join if you care about this.